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Chapter 1 

Supercomputing and Parallelism 


The Connection Machine system CM-5 provides high performance plus ease of 
use for large, complex, data-intensive applications. Its architecture is designed 
to scale to teraflops or teraops performance for terabyte-size problems. It 
features 

■ independent scalability of processing, communication, and I/O 

■ extremely high floating-point and integer execution rates 

■ high processor-memory bandwidth 

* efficient execution of high-level languages 

■ multiple job execution, both timeshared and partitioned 

■ multi-user network access 

■ security between users 

■ flexible high-bandwidth I/O 

■ balanced scalar and parallel execution 

■ balanced I/O, processing, and memory 

" high reliability and high availability 

The CM-5 continues and extends support for the parallel programming model 
that has proved so successful in the CM-2 and CM-200. To achieve its goals, the 
CM-5 takes advantage of the latest developments in high-speed VLSI, new com¬ 
piling technologies, RISC microprocessors, operating systems, and networking. 
It combines the best features of existing parallel architectures — including fine- 
and coarse-grained concurrence, MIMD and SEMD control, and fault tolerance — 
in a single, integrated, “universal” architecture. 
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1.1 Parallelism 

One of the most notable advances in computing technology over the past decade 
has been in the use of parallelism, or concurrent processing, in high-performance 
computing. Of the many types of parallelism, two are most frequently cited as 
important to modem programming: 

■ control parallelism, which allows two or more operations to be performed 
simultaneously. (Two well-known types of control parallelism are 
pipelining, in which different processors, or groups of processors, operate 
simultaneously on consecutive stages of a program, and functional 
parallelism, in which different functions are handled simultaneously by 
different parts of the computer. One part of the system, for example, may 
execute an I/O instruction while another does computation, or separate 
addition and multiplication units may operate concurrently. Functional 
parallelism frequently is handled in the hardware; programmers need take 
no special actions to invoke it.) 

■ data parallelism, in which more or less the same operation is performed 
on many data elements by many processors simultaneously. 

While both control and data parallelism can be used to advantage, in practice the 
greatest rewards have come from data parallelism. There are two reasons for this. 

First, data parallelism offers the highest potential for concurrency. Each type of 
parallelism is limited by the number of items that allow concurrent handling: the 
number of steps that can be pipelined before dependencies come into play, the 
number of different functions to be performed, the number of data items to be 
handled. Since in practice the last of these three limits is almost inevitably the 
highest (being frequently in the thousands, millions, or more), and since data par¬ 
allelism exploits parallelism in proportion to the quantity of data involved, the 
largest performance gains can be achieved by this technique. 

Second, data parallel code is easier to write, understand, and debug than control 
parallel code. 

The reasons for this are straightforward. Data parallel languages (such as the 
Connection Machine system’s CM Fortran, C*, and *Lisp) are nearly identical 
to standard serial programming languages. Each provides some method for 
defining parallel data structures: CM Fortran uses the Fortran 90 array features, 
while the other two languages add a new data type. Once the data sets (arrays, 
matrices, structures, etc.) are defined, a single sequence of instructions, as in 
serial code, causes operations to be performed concurrently either on the full data 
sets or on selected sections thereof. Very little new syntax is added: the power 
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of parallelism arises simply from extending the meaning of existing program 
syntax when applied to parallel data. 

The flow of control in a data parallel language is also nearly identical to that of 
its serial counterpart Since this control flow, rather than processor speed, deter¬ 
mines the order of execution, race conditions and deadlock cannot develop. The 
programmer does not have to add extra code to ensure synchronization within a 
program; the compilers and other system software maintain synchronization 
automatically. Moreover, the order of events, being essentially identical to that 
in a serial program, is always known to the programmer, which eases debugging 
and program analysis considerably. 


1.2 Parallel Programming 

Prior to the CM-5, the most successful implementation of the data parallel 
progr amming model was the so-called SIMD (Single Instruction, Multiple Data) 
architecture. As implemented on the Connection Machine models CM-2 and 
CM-200, the SIMD architecture has shown itself to be extremely efficient and 
powerful. Arrays that are hundreds or thousands of elements in size are laid out 
across hundreds or thousands of processors, one element per processor, in a 
format whose logical structure matches that of the data set itself and the 
operations to be performed on it. (See Figure 1.) When there are more array 
elements than processors, the processors subdivide themselves into “virtual 
processors” and give each element its own virtual processor. Instructions are then 
executed upon each element simultaneously. For example, given three 400 x 400 
arrays. A, B, and c, the statement C ■ A + B is a single statement — and is 
executed as such — in data parallel programming. 

But “data parallel” and “SIMD” are not necessarily synonymous terms. Consider, 
for example, finite difference codes. Boundary elements in these codes usually 
require special treatment, which means conditional branching. In data parallel 
languages, such branching is frequently coded along the lines of 

where (boundary_elements) 
do_a 

elsewhere 

do_b 

end where 
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Figure 1. Examples of data sets. 

Some problems involve data sets organized as multidimensional grids. The calculation for each data 
print relies on the values of neighboring data pieces. The pattern of interaction is both local and regu¬ 
lar. Finite difference methods are typical of this category. 

Other problems, exemplified by finite element methods, operate on data that is less rigidly structured. 
The calculation for each data point again relies on the values of nearby data points, but the pattern 
of interaction is irregular. In some cases the pattern of interaction may change over time, as dictated 
by die content of the data (for example, to make the mesh finer in regions of interest). 

For tasks such as sorting, the manner in which data points interact depends greatly on the data values; 
tin pattern of communication will be both nonlocal and irregular. 

The communications networks of the CM-S are designed to support both regular and irregular pat¬ 
terns of communication. Patterns that are predominantly local are rewarded with higher throughput. 
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A pure SIMD implementation of such code will execute the where branch for all 
boundary elements, and then execute the elsewhere branch for all interior 
elements. A M3MD (Multiple Instructions, Multiple Data) implementation will 
execute both branches simultaneously, with each processor making its own 
decision whether to fetch and execute instructions for the where branch or for 
the elsewhere branch for each element. When all processors have finished 
execution, the program will proceed to the next statement. Note that both 
implementations use the same code; both are undoubtedly data parallel 
programming. The only externally visible difference will be in performance; the 
second implementation, by using functional parallelism in support of data 
parallelism, can run faster than the first. 

Note that the order of events in either case is identical to the order that would 
obtain for serial code. Even if the where branch takes several times as long as 
the elsewhere branch to execute, no processors will proceed to the subsequent 
statement until all have finished executing the where block. System software 
implements this control; the programmer does not have to worry about it. Only 
where events have no dependencies on each other, so that their order does not 
matter, will the order be unknown. (Figure 2 illustrates the combined indepen¬ 
dence and synchronization of program execution in this MIMD implementation 
of the data parallel progr ammin g model.) 


Extensions to the Data Parallel Model 

Although data parallel programming provides the biggest gains among known 
techniques of parallelism, it may sometimes be usefully extended by mixing in 
other parallel techniques. For example, some applications may perform best 
when divided into sections, each section making use of data parallel program¬ 
ming and all sections together acting as a pipeline. Thus, one process might 
gather data and do some preliminary selecting or compacting; it would then pass 
its results to a second process, which would do more intense computing on the 
smaller data set; and that process would then pass its results to a third process, 
which would perform some visualization or reporting function. On the CM-5, all 
three processes can run in parallel, either timesharing on a single partition or 
perhaps each having exclusive use of a separate partition. In the latter case, each 
process has its own physical computing resources; I/O for the first process and 
computation for the second occur simultaneously, with no impact on each other 
or on the third process. 
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Code 


Data 



Independent 

Computation 

(no synchroni¬ 
zation needed) 


Cooperative 

Computation 

(synchronization 

needed) 


figure 2. Code running on a CM-5. 

A partition manager loads identical code onto every processing node in a partition. Data is distributed 
across the nodes: Given an array of m values and a partition of n nodes, each node handles m/n values. 

Each node executes its program independently, branching according to its own data values. As long 
as computation remains local, no synchronization or communication is needed. 

When data needs to be transferred among processors — for example, when processors must each con¬ 
tribute values to a global sum — the communications networks carry the data and enforce the neces¬ 
sary synchronization. (For global combining operations such as sub, the Control Network performs 
the reduction.) 
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The CM-5 thus extends the data parallel progr amming model developed for the 
CM-2 and CM-200 to incorporate an even broader and more widely useful mix of 
parallel techniques. Optimized for data parallelism, the CM-5 nonetheless sup¬ 
ports other forms of parallelism that can either enhance data parallelism or allow 
the porting of programs from other architectures. This extended model, which we 
may call coordinated parallelism , represents the best that is known about parallel 
programming today. 


1.3 Advantages of a Universal Architecture 

In the past, programmers of supercomputers were forced to choose between 
MIMD machines, which were good at independent branching but bad at syn¬ 
chronization and communication, and SIMD machines, which were good at 
synchronization and communication but poor at branching. The CM-5 supports 
the full data parallel model by providing high performance for branching and 
synchronization alike — and, indeed, for all aspects of both SIMD-style and 
MEMD-style architectures. 


Scalable Computing 

The CM-5 is the first architecture to offer truly scalable computing. It does this 
by combining its universal architecture with completely scalable hardware and 
scalable programming models. An application that runs on a small CM-5 can be 
run without change on a larger CM-5, and will see its performance increase 
accordingly. That same application may also be run on a workstation, mainframe, 
or shared-memory multiprocessor. 

Figure 3 shows some of the ways in which applications originally written for 
other systems can run on the CM-5. (The illustration is based on the Fortran lan¬ 
guage, but the CM-5 supports C and Lisp as well.) 

■ Existing CM-2 and CM-200 Fortran programs can be moved directly onto 
the CM-5; recompiling is all that is needed. 

* In some cases, partial recoding of CM-2 and CM-200 programs can bring 
better performance by taking advantage of new compiler features. 
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Figure 3. Transporting programs to the CM-5. 


CM Fortran programs written for the CM-2 and CM-200, and "Fortran 77 programs written for 
execution on serial computers, and message-passing programs designed to ran on MIMD-only archi¬ 
tectures are all easily ported to the universal architecture of the CM-5. 


10 


November 1993 

Copyright © 1993 Thinking Machines Corporation 












Chapter 1. Supercomputing and Parallelism 


■ Applications written using a message-passing programming model for 
distributed memory computers can run on the CM-5 by substituting calls 
to a CM-5 message-passing library for the original calls. 

* With some additional recoding, message-passing programs can be tuned 
to take advantage of the superior hardware facilities for cooperative com¬ 
putation offered by the CM-5. 

■ Existing Fortran 77 codes can be migrated to Fortran 90, using CMAX, and 
then compiled by the CM Fortran compiler. This allows many widely used 
codes to function effectively on the CM-5. 


1.4 Looking Ahead 

The next two chapters explain further what coordinated parallelism on the CM-5 
offers. Chapter 2 shows how the CM-5 hardware is optimized to support coordi¬ 
nated parallelism, while Chapter 3 provides further explanation of the features 
to be found in data parallel languages. 
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Chapter 2 

The Basic Components of the CM-5 

MB IMM 


At its best, parallel processing brings many processors, working in close coor¬ 
dination, to bear on large quantities of data. An effective parallel-processing 
systmn must provide a large amount of memory to hold this data and must pro¬ 
vide effective access to the data for hundreds or thousands of processors. The 
CM-5 system meets this goal. Moreover, it allows its memory and processor 
resources to be applied equally effectively to a single large problem or to job 
requests from dozens of simultaneous users. 

Traditional computer architectures, such as the generic system diagrammed in 
Figure 4, link one or a few processors to a shared memory via a system bus. This 
worked well when processing speeds were slower and the number of processors 
was small. Nowadays it is much more cost-effective to use many processors than 
to try to make the processors faster. With many processors, a simple bus is a 
bottleneck, and the complex switches that can provide fast access to a shared 
memory for every memory reference are both expensive and complicated. Two 
more changes to the early model are therefore needed to balance communication 
speed with processing speed: memory must be distributed, rather than shared; 
and a high-bandwidth network, rather than a bus, must be used. Figure 5 dia¬ 
grams this second architecture as it appears in die CM-5. 


2.1 Processors 

A CM-5 system may contain tens, hundreds, or thousands of parallel processing 
nodes. Each node has its own memory. Nodes can fetch from the same address 
in their respective memories to execute the same (SIMD-style) instruction, or 
from individually chosen addresses to execute independent (MIMD-style) 
instructions. 
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The processing nodes are supervised by a control processor, which runs an 
enhanced version of the UNIX operating system. Program loading begins on the 
control processor, it broadcasts blocks of instructions to die parallel processing 
nodes and them initiates execution. When all nodes are operating on a single 
control thread, the processing nodes are kept closely synchronized and blocks are 
broadcast as needed. When the nodes take different branches, they fetch instruc¬ 
tions independently and synchronize only as required by the algorithm under 
program control. 

To maximize system usefulness, a system administrator may divide the parallel 
processing nodes into groups, known as partitions. There is a separate control 
processor, known as a partition manager, for each partition. Each user process 
executes on a single partition, but may exchange data with processes on other 
partitions. Since all partitions utilize UNIX timesharing and security features, 
each allows multiple users to access the partition while ensuring that no user’s 
program interferes with another’s. 

Other control processors in the CM-5 system manage the system’s I/O devices 
and interfaces. This organization allows a process an any partition to access any 
I/O device, and ensures that access to one device does not impede access to other 
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Figure 5. Organization of the Connection Machine system. 


devices. (Figure 6 shows how this distributed control works with the CM-5’s 
interprocessor communication networks to enhance system efficiency.) 


2.2 Networks 

Every control processor and parallel processing node in the CM-S is connected 
to two scalable interprocessor communication networks, designed to give low 
latency combined with high bandwidth in any possible configuration a user may 
wish to apply to a problem. Any node may present information, tagged with its 
logical destination, for delivery via an optimal route. The network design pro¬ 
vides low latency for transmissions to near neighboring addresses, while 
preserving a high, predictable bandwidth for more distant communications. 

The two interprocessor communications networks are the Data Network and the 
Control Network. In general, the Control Network is used for operations that 
involve all the nodes at once, such as synchronization operations and broadcasting; 
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Figure 6. Distributed control on the CM-S. 

Functionally, the CM-5 is divided into three major areas. The first contains some number of partitions, 
which manage and execute user applications; the second contains some number of I/O devices and 
interfaces; and the third contains the two interprocessor communications networks that connect all 
parts of die first two areas. (A fourth functional area, covering system management and dia gnosti cs, 
is handled by a third interprocessor network and is not shown in this drawing.) 

Because all areas of the system are connected by the Data Network and the Control Network, all can 
exchange information efficiently. The two networks provide high bandwidth transfer of messages of 
all sorts: downloading code from a control processor to its nodes, passing I/O requests and acknowl¬ 
edgments between control processors, and transferring data, either among nodes (whether in a single 
partition or in different partitions) or between nodes and I/O devices. 
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the Data Network is used for bulk data transfers where each item has a single 
source and destination. 

A third network, the Diagnostics Network, is visible only to the system 
administrator; it keeps tabs on the physical well-being of the system. 

External networks, such as Ethernet and FDDI, may also be connected to a CM-5 
system via the control processors. 


2.3 I/O 

The CM-5 runs a UNIX-based operating system; it provides its own high-speed 
parallel file system, and also allows full access to ordinary NFS file systems. It 
supports both HIPPI (High-Performance Parallel Interface) and VME interfaces, 
thus allowing connections to a wide range of computers and I/O devices, while 
using standard UNIX commands and programming techniques throughout A 
CMIO interface supports mass storage devices such as the Data Vault and enables 
sharing of data with CM-2 and CM-200 systems. 

I/O capacity may be scaled independently of the number of computational 
processors. A CM-5 system of any size can have the I/O capacity it needs, 
whether that be measured in local storage, in bandwidth, or in access to a variety 
of remote data sources. Communications capacity scales both with processors 
and with I/O. Customers may choose both the processing power and the I/O capa¬ 
bilities that meet their needs, and the CM’s communications capacity is 
automatically scaled to match. 

Just as every partition is managed by a control processor, every I/O device is 
managed by an input/output control processor (IOCP), which provides the soft¬ 
ware that supports the file system, device driver, and communications protocols. 
Like partitions, I/O devices and interfaces use the Data Network and the Control 
Network to communicate with processes running in other parts of the machine. 
If greater bandwidth is desired, files can be spread across multiple I/O devices: 
a striped set of eight DataVaults, for example, can provide eight times the I/O 
bandwidth of a single DataVault. 

The same hardware and software mec hanisms that transfer data between a parti¬ 
tion and an I/O device can also transfer data from one partition to another 
(through a named UNIX pipe) or from one I/O device to another. 
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2.4 A Universal Architecture 

The architecture of the CM-5 is optimized for data parallel processing of large, 
complex problems. The Data Network and Control Network support fully gen¬ 
eral patterns of point-to-point and multiway communication, yet reward patterns 
that exhibit good locality (such as nearest-neighbor communications) with 
reduced latency and increased throughput. Specific hardware and software sup¬ 
port improve the speed of many common special cases. Chapter 3 outlines the 
nature of this support, which is discussed in even greater detail in later chapters. 

Two more key facts should be noted about the CM-5 architecture. First, it 
depends on no specific types of processors. As new technological advances 
arrive, they can be moved with ease into the architecture. Second, it builds a 
seamlessly integrated system from a small number of basic types of modules. 
This creates a system that is thoroughly scalable and allows for great flexibility 
in configuration. 
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Chapter 3 

Data Parallel Programming 


Connection Machine systems are designed to operate on large amounts of data. 
These data sets may be richly interconnected or totally autonomous. A scientific 
simulation data set, such as a finite-element grid, is highly interconnected, with 
every node value connected to several element values and vice versa. Disparate 
values are continually being brought together, computed on, and redispersed. A 
document database, on the other hand, may be totally autonomous. The search 
of any one document proceeds entirely without reference to any of the others. 
There is no need to repeatedly combine information from multiple documents in 
a single computation. 

The Connection Machine system is made up of large numbers of processors, 
each with its own local memory. From the programming perspective, it is pos¬ 
sible to think of the memory in either of two ways. When computing on 
interconnected data sets, it is easiest to think of the memory as a single multi¬ 
gigabyte data space. When computing on autonomous data, it is easiest to think 
of it as many local memories. 

Efficient Connection Machine algorithms invariably combine both points of 
view. When gathering data, one regards it as global. When computing on the 
gathered data, one thinks of it as local data, and of the computations themselves 
as being carried out in multiple local memories. 


3.1 Data Sets and Distributed Memory 

Data parallel programs can be expressed in terms of the same data structures used 
in serial programs. Em phasis is on the use of large, uniform data structures, such 
as arrays, whose elements can be processed all at once. A statement such as 
A - B + c, which in a serial language adds a single number B to a single num- 
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ber c and stores the result in A, can equally well indicate thousands of 
simultaneous addition operations if A, B, and C are declared to be arrays. 

In fact, die basic unit of data in a Connection Machine system is the array, or 
same other form of parallel variable. Arrays are spread across the distributed 
memory of the CM so that each element is in the memory of a separate processor. 
If the number of elements in the array matches the number of physical proces¬ 
sors, then each local memory receives one element. If the number of elements 
in the array exceeds die number of physical processors, then several elements are 
placed in the memory of each processor. The elements remain distinct. Each is 
considered to have its own “virtual processor” and is handled accordingly. 

The choice of parallel data structures is perhaps the most important aspect of data 
parallel programming. Once data has been properly allocated, executable code 
follows naturally. It is not necessary to use different operation names for differ¬ 
ent cases. Parallel code can look just like serial code, in the same way that 
floating-point arithmetic looks like integer arithmetic. A conventional compiler 
examines the declarations of variables B and C to determine whether the expres¬ 
sion B + C requires an integer or floating-point add instruction. In the same 
way, a compiler for a data parallel language determines whether B + c requires 
a single addition operation or thousands. 


Array Layout 

A user program runs within a partition of a CM-5. Defined by die administrator, 
a partition may represent part or all of the CM-5 system. In order to allow a pro¬ 
gram compiled with a CM compiler to run on a partition of any size, the precise 
mapping of data elements to processors occurs at run time; the run-time system 
lays out the array for best efficiency. Compiler directives in each language allow. 
programmers to request that the mapping be optimized for particular purposes. 


Local Computation 

Unless the programmer has specified otherwise, arrays of equal size and shape 
will have identical layouts. Thus, identical elements of each such array will share 
the memory of a particular processor. When a computational statement such as 
C ■ A + B is executed, each processor locates and stores the needed data in its 
own memory; no interprocessor data movement is required, and the operation 
proceeds very quickly. 
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3.2 Interconnected Data Structures 

The inherent structure of most data sets links each data element to some, but not 
all, other elements. Often the linkages are to neighboring elements, in which case 
the structure is said to be localized. 

A matrix, for example, is generally thought of as having row and column struc¬ 
ture. Elements that share one subscript are used in a connected way. If the matrix 
is used as part of a finite-difference calculation, then the horizontal and vertical 
neighbors are continually brought together for computation. If a data structure 
is converted from the spatial domain to the frequency domain, then a butterfly 
pattern may be required during the course of a Fast Fourier Transform (FFT). 

It is not possible to arrange interconnected data so that all the pieces of data will 
reside in the processors that need to use them, because the same piece of data 
may be used in more than one part of the computation, by more than one proces¬ 
sor. Interprocessor communication is required. Computations on data structures 
have a definite rhythm: first data elements are brought together, then computa¬ 
tions are performed. Once the data elements have been brought together, the 
computations are local. Even on very complex data structures, it is possible to 
have most of the interacting elements located in the same processor memory. 
Topically, only a few need to be brought in from another processor’s memory. 


Establishing Linkages among Data Elements 

Data parallel languages use pointers or array subscripts to establish connections 
between processors and hence between their data elements. If the required pat¬ 
terns are regular and local, such as processors sharing data with their nearest 
neighbors, then each processor can easily calculate the address of its neighbors 
as needed. For irregular arrays, an array of pointers, itself a parallel data struc¬ 
ture, establishes an arbitrary pattern of intercommunication. 


3.3 Interprocessor Communications 

Them are four important categories of interprocessor communications: 

■ replication 

■ reduction 
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■ permutation 

■ parallel prefix 

Each of these four types of data transfer can be applied to regular or irregular 
data sets: to vectors, matrices, multidimensional arrays, variable-length vectors, 
linked lists, and completely irregular patterns. All these combinations are sup¬ 
ported by data parallel software within the CM-5. In addition, the most common 
or otherwise important cases are supported directly by special hardware built into 
the Control Network In all cases, the CM-5’s high performance is a result of hav¬ 
ing all the processors act cooperatively to achieve the needed data transfers. 


Replication 

Replication consists of taking some data values and making a larger number of 
data values by copying them. (See Figure 7.) A single value, for example, may 
be broadcast to all processors for use in a computation. A vector may be copied 
into each column of a matrix, or into each row. (The general case of making 
many copies of an array to fill a higher-dimensional array is called spreading.) 
A less regular pattern is the division of a collection into arbitrary subsets of vary¬ 
ing size, and one may wish to broadcast a different value within each subset. If 
the subsets are ordered and not interleaved, one may regard them as a collection 
of vectors of various sizes; this common case can be implemented more effi¬ 
ciently than the general case. 

Most data parallel programming languages support broadcasting implicitly; if A 
and B are arrays and x is a scalar quantity, the statement A = B + x implicitly 
broadcasts x to all processors so that the value of x can be added to every element 
of B. The general case of replication is typically supported through parallel array 
indexing, that is, indexing the same array with many index values. If some of the 
index values are the same, then the same array element will be copied to many 
places. Intrinsic functions (such as SPREAD in Fortran) cover important special 
cases. 
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Reduction 

Reduction is the opposite of replication: Reduction consists of taking some data 
values and making a smaller number of data values by combining them. (See 
Figure 8. Note that it is similar to Figure 7 except that the arrows all point the 
other way.) A single value, for example, may be produced by computing the sum 
of a set of values; here the combining operation is addition. Other important 
reduction operations include taking largest or smallest value (maximum or mini¬ 
mum), logical AND (are all results true?), and logical OR (is any result true?). All 
these start with a large collection of values and reduce diem to a single result 

More complex patterns of reduction mirror related patterns of replication. The 
rows of a matrix may be summed to produce the elements of a column-vector 
result; this is the opposite of a spread operation. A collection of variable-length 
vectors may be reduced, producing a separate sum for each vector. Completely 
general patterns may be specified by index values or pointers. 


November 1993 

Copyright © 1993 Thinking Machines Corporation 


23 





Connection Machine CM-5 Technical Summary 


Most data parallel languages provide a collection of operators or intrinsic func¬ 
tions for expressing various patterns of reduction. For example, the Fortran 
statement X ■ sum (A) sums all the elements of the array A and places the scalar 
result in x. The same computation can be expressed in C* as 

x = (+- a) ; 

and if the old value of X is to be included in the sum one may simply write 
x += a; 

(which says that every element of a is to be added into x). 



Permutation 

Permutation rearranges its inputs to produce the same number of results; every 
data value comes from one place and goes to one place. (See Figure 9.) Trans- 
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posing a matrix, reversing a vector, shifting a multidimensional grid, and FFT 
butterfly patterns are all examples of permutation. 

Data parallel languages usually express permutation through parallel array 
indexing and special-purpose intrinsic functions. A typical example of use might 
be a finite-difference grid used in the discretization of Laplace’s Equation, in 
which the average of four nearest neighbors is iteratively computed: 

C - 0.25 * ( CSHIFT(A,1,+1) 

& + CSHIFT(A,1,-1) 

&. + CSHIFT (A, 2,+1) 

& + CSHIFT(A,2,-1) ) 

Here CSHIFT is a Fortran intrinsic that shifts (or rotates) an array with periodic 
boundary conditions. Elements shifted off one edge axe circularly shifted into the 
opposite edge; thus no elements are lost in this operation. In contrast, eoshift 
performs an end-off shift that discards shifted-out elements and introduces a pad 
value, usually zero, into vacated positions; this operation is thus technically a 
hybrid of permutation (of array elements) and replication (of the pad value). 
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Figure 9. Permutation. 
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Parallel Prefix 

A parallel prefix operation is a very specific compound operation; it produces as 
many results as inputs, but each result may be a reduction of many inputs, and 
each input may contribute to many results. There happens to be a rapid and effi¬ 
cient parallel method fen performing this complex compound operation; the 
CM-5 supports it with a combination of hardware and library software. It is of 
particular use in parallel computations because it permits rapid parallel execution 
of operations that at first glance appear to be inherently sequential. 

The simplest example of a parallel prefix operation is computing the running 
totals of a list of numbers. The Mi result is the sum of the first k inputs. (See 
Figure 10.) There is a simple sequential implementation of such a computation: 

RUNNING_TOTAL - 0.0 
DO J - 1,1000 

RUNNING_TOTAL = RUNNING_TOTAL + B(J) 

A (J) «= RUNNING_TOTAL 
END DO 

This would appear to be an inherently sequential process, scanning the array B 
from one aid to the other, but by bringing many processors to bear in parallel, 
one can perform this computation in 10 steps instead of 1000 steps (10 is approx¬ 
imately the base-2 logarithm of 1000). 


26 


November 1993 

Copyright © 1993 Thinking Machines Corporation 




Chapter 3. Data Parallel Programming 


D 

B 

BB 

BB 

B 

B 



B 

□ 

BB 

igU] 

□ 

§3 


1-D sum-prefix 



3 6 1 

5 


0 2-4 6 5 

2 6 4 

\ 

\ 

\ 

\ 


3 9 10 

1 

_L 

0 2-2 4 9 

2 8 12 


Variable-length vec¬ 
tors 



Linked lists 


Figure 10. Parallel prefix. 


3.4 Conditionals 

Conditional operations are an essential part of data parallel programming, as of 
serial programming. Some of the control constructs (if, case) are identical; oth¬ 
ers (where, forall) are specific to parallel usage. 

Data parallel programs implement conditionals by limiting the impact of opera¬ 
tions to a certain subset of the data elements of a parallel data structure. A 
conditional operation first tests a specified condition in all elements of a parallel 
data structure. The specified operation is then performed only on elements for 
which the condition is true, while either an alternate operation, or no operation, 
is performed on the other elements. As in serial programs, conditionals may be 
nested in very general ways. 
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3.5 In Summary 

The data parallel model of computation makes it easy to program massively par¬ 
allel computers. The model is also suitable for use on sequential computers, 
including vector processors, and on shared-memory parallel computers. High- 
level data parallel languages support the data parallel style. The CM-5 architec¬ 
ture is specifically designed for efficient execution of data parallel programs on 
large data sets. 

Data parallel programming provides a practical framework for organizing inter¬ 
processor communication. An analogy may be drawn with die way “structured 
programming” has provided a practical framework for organizing control flow 
in sequential programs. Each model begins with primitive computations and uses 
a fixed set of standard combining forms to impose structure on the program. 

Structured programming begins with simple assignment statements and observes 
that most patterns of control flow can be expressed in terms of sequencing 
(begin-end), conditional branching (if-then-else), and looping (while- 
do). If these structures are conventionally used wherever appropriate, then use 
of a low-level construct such as a goto is a strong indication, and a useful one, 
that something unusual is going on; maintenance programmers should pay spe¬ 
cial attention, and language designers should ask whether the situation represents 
a class of problems that could be addressed more generally. Conventional syntax 
has evolved for certain frequently used compound patterns, such as case state¬ 
ments and do loops. 

Similarly, data parallel programming begins with local computations and 
observes that most patterns of interprocessor communication can be expressed 
in terms of replication, reduction, permutation, and parallel prefix. If these struc¬ 
tures are conventionally used wherever appropriate, then use of a low-level 
construct such as explicit message-passing is a strong indication, and a useful 
one, that something unusual is going on; maintenance progr amm ers should pay 
special attention, and language designers should ask whether the situation repre¬ 
sents a class of problems that could be addressed more generally. Conventional 
syntax has evolved for certain frequently used compound patterns, such as shift¬ 
ing of regular grids, sorting, and fast transforms such as FFT. 

As Figure 11 suggests, the data parallel model simplifies the programmer’s job 
by providing for parallel programs the conventional structure and discipline that 
structured programming provides for sequential programs. Indeed, the data par¬ 
allel model is the only programming methodology yet put forward that provides 
a coherent global organization for structuring programs that operate on thou¬ 
sands of processors. 
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Figure 11. Structuring programs. 


3.6 More Information to Come 

This introduction barely begins to present the features and capabilities of the 
CM-5. The remainder of this book presents them in somewhat more detail 
(although still at a summary level). Part II discusses the software that supports 
applications programming on the CM-5. Part in discusses the various aspects of 
the system’s architecture. 

For information beyond this, you can turn to technical reports on Connection 
Machine programming, and to the CM-2-CM-200 and CM-5 documentation sets.. 
Especially recommended for new users are the manuals Getting Started in C* 
and Getting Started in CM Fortran. 
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The Connection Machine system provides a well-designed, thoroughly inte¬ 
grated software environment to facilitate applications programming. The 
environment seamlessly blends industry standards with data parallel enhance¬ 
ments to provide both high performance and ease of use. 


4.1 Base System Software 

The use of industry standards begins with the UNIX operating system and its net¬ 
work file system (NFS). Full XI1 support provides windowing capability; the 
NQS batch system allows submission of batch jobs locally or across a network. 
Networking support includes Ethernet and FDDI for local area networking, and 
VME and HIPPI for high-performance networking. 

Ease of use, meanwhile, is enhanced by Prism, the windowed, integrated devel¬ 
opment environment for program editing, debugging, and performance analysis. 
Another CM enhancement, a parallel, high-performance file system, provides 
excellent I/O performance and allows use of extremely large files. 


4.2 Languages and Libraries 

For programming, users choose among the popular languages C, Fortran, and 
Lisp. The CM offers data parallel versions of each language, extending the lan¬ 
guages’ own constructs in intuitive ways to support the data parallel model. 
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In addition, specialized libraries offer support for graphics, communications, and 
mathematical and scientific programming. All are available from the high-level 
languages; low-level programming is not required to achieve high performance 
on the Connection Machine supercomputer. 



Figure 12. Layered software of the Connection Machine system. 
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4.3 CM Software Summarized 

Figure 12 summarizes the layered software of the Connection Machine system. 
This software is discussed in the following chapters: 

Operating system, file systems, I/O programming. Chapter 5 

Prism (the development environment). Chapter 6 

NQS batch system, checkpointing, 

the execution environment. Chapter 7 

CMAX converter. Chapter 8 

CM Fortran programming language. Chapter 9 

C* programming language. Chapter 10 

*Lisp programming language. Chapter 11 

CM Scientific Software Library 
(linear algebra. Fast Fourier Transforms, 

random number generation, histograms). Chapter 12 

Visualization. Chapter 13 

CMMD (message-passing communications library). Chapter 14 
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Chapter 5 


The Operating System: CMOST 


The CM-5 operating system, CMOST, is an enhanced version of the UNIX operat¬ 
ing system. The enhancements optimize computation, communication, and 
I/O performance within the CM-5 system itself, while the adherence to UNIX 
standards allows the CM-5 to interact efficiently with other computers in a heter¬ 
ogeneous, networked environment 

Because the CMOST operating system is built upon standard UNIX, it can provide 
all the services that any standard network server provides: 

■ timesharing and batch processing 

■ standard UNIX protection, security, and user interfaces 

■ support for all standard UNIX-based communications protocols 

■ exchange of data with other systems in an open, seamless fashion 

* the ability to access files on other systems via NFS protocols and to supply 
data to other systems by acting as an NFS server 

■ the Network Queuing System (NQS) and other standard network-oriented 
programs 

a for scalar programs, binary compatibility with SunOS 

Enhancements provide higher-performance services and expanded functionality 
for users within the CM-5 system: 

* high-speed file access 

■ fast parallel intraprocessor communications capabilities 

■ other parallel operations for optimal utilization of CM-5 hardware 
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■ central administration and resource management for all CM-5 computa¬ 
tional and I/O facilities 

" support for extended models of data parallel programming, such as data 
parallel pipes 

■ support for other parallel progr amming models 
* checkpointing 


5.1 CMost and the CM-5 Architecture 

The computational nodes on a CM-5 are grouped into partitions. A partition can 
be as small as 32 processors, or as large as the entire machine. The partitioning 
is flexible and is controlled by the system administrator, who can create and alter 
partitions as needed to meet site requirements. Each partition operates indepen¬ 
dently under a control processor acting as a partition manager (PM). Users log 
in to (or rsh on to) the PM and, once logged in, have full access to the PM itself, 
to all the computational nodes it controls, and—through the operating sys¬ 
tem—to all the I/O resources, partitions, and network connections of the CM-5 
system. Figure 13 shows a user’s-eye view of the CM-5. 

Each partition manager runs a full version of the CMOST operating system. The 
PM makes all operating system resource allocation decisions and all swapping 
decisions for its partition, as well as most system calls for process execution, 
memory management, and I/O. 

Each processing node runs an operating system microkernel, which supports the 
mechanisms required to implement the policy decisions made in the partition 
manager. All operating system code operates in supervisor mode, allowing it to 
access any network address and memory address in the machine. 

When a user process begins running, its partition manager downloads code to the 
processing nodes and broadcasts identical memory maps to each node. The nodes 
then execute the provided code, each acting on its own data and executing com¬ 
putations and branches accordingly. 

All nodes in a partition operate on the same process at the same time. Interpro¬ 
cessor communication between nodes within an application is handled 
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Figure 13. A user’s view of a CM-5. 

Users access a CM-5 system by running rlogin or rah commands on a specific partition manager. 
A user program begins execution on the PM, downloads code to the nodes, then runs on nodes and PM 
both, passing data as needed among processors. 

If a program needs to exchange data with an I/O device or with another process, the PM arranges the 
transfer, via system calls to other control processors. Data then flows directly between the nodes and 
the I/O device, nodes of another partition, or external network interface, thus ensuring that a parallel 
process gets the full benefit of the CM-5 Data Network bandwidth. 
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entirely by user code, without any operating system overhead. For external com¬ 
munications the user process calls on the operating system, which requests and 
supervises the transfer on behalf of the user process. Data may be transferred 
between two processes running timeshared in the same partition or between two 
processes running concurrently in different partitions. 

Interprocess communication is based on parallel extensions to UNIX sockets and 
pipes and is managed by the operating system. I/O transfers are handled in the 
same manner as transfers between partitions. 


5.2 CMOST and the Users 

Users typically access the CM-5 through an external network, either in batch 
mode, via the NQS qsub command, or interactively, via rlogin or rsh com¬ 
mands. 

Each PM and IOCP within the CM-5 is a separate host cm the network. Users can 
log in to any PM or IOCP for which they have appropriate privileges. Once 
logged in, a user has access to the full resources controlled by that control proces¬ 
sor and to both local and networked file systems; the user can then run processes 
that use a control processor alone or a full partition of PM plus processing nodes. 
Since the set of control processors (PMs and IOCPs) within a CM-5 form a loosely 
coupled network of UNIX computers, a user with appropriate privileges can also 
run programs on any processor within the CM-5 using the normal UNIX network¬ 
ing commands. 


The Program Development Environment 

The program development environment available to CM users offers the full 
capabilities of UNIX and the X 'Window System. In addition, it offers enhance¬ 
ments specific to CM parallel programming: parallel languages, specialized 
libraries, and tools for parallel debugging and performance analysis. Prism, the 
CM-5’s integrated programming environment, facilitates programmers’ use of 
the machine (see Chapter 6). 
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The Program Execution Environment 

The program execution environment on the CM-5 supports both interactive, time- 
shared program execution and batch execution using the NQS batch system. 

Several facilities, such as automatic checkpointing and Prism, the CM program¬ 
ming environment, aid program development and robustness during execution. 
(See Chapter 7.) 


5.3 CMost and the Administrator 

CMOST provides the administrator with tools for efficient and flexible resource 
management. It allows the administrator to partition the CM-5 for spacesharing 
among users, to set up the NQS batch system and the accounting system, and to 
monitor system usage, error logging, and power and environmental concerns. In 
addition, it provides all the standard UNIX capabilities, such as setting process 
priorities for use with process scheduling, setting disk quotas to control disk 
space usage, backing up and restoring user data, and setting up user permissions. 

CM-5 administration is centralized at a system console, using commands that are 
modeled on SunOS 4.1 commands. The commands execute through a set of dae¬ 
mon processes that run (depending on their tasks) on the system console 
processor, the diagnostic console processor, or the partition managers. 


5.4 I/O and File Systems 

I/O programming on the CM-5 uses standard UNIX mechanisms, including sock¬ 
ets, pipes, character devices, block devices, and serial files. All I/O operations are 
modeled as reads and writes to files, regardless of the type of device used for 
storage. 

CMost extends the UNIX I/O environment to support parallel reads and writes 
and to support very large files, including files above the size supported in most 
current UNIX implementations. The virtual file system interface supports device¬ 
independent file behavior and supports many different file system types, 
including the standard UNIX file system, the Network File System (NFS), and 
two CM file systems: CMFS, which is supported on all Connection Machine sys¬ 
tems and which allows the CM-5, the CM-2, and the CM-200 to share files, file 
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systems, and I/O peripherals; and a high-performance file system, SFS, that is 
supported only on the CM-5. 

The CM-5 arranges communications to allow maximum simultaneous perfor¬ 
mance of computation and I/O. Transfers from one partition do not affect the 
performance of other partitions. Simultaneous transfers from several partitions 
see minimal interactions unless they require access to the same I/O device. Direct 
I/O-to-I/O transfers allow direct movement of data between a remote machine 
and a CM-5 I/O device, or between primary and secondary I/O devices on a CM-5, 
without affecting activities in partitions. 


The CM-5 File System 

The Connection Machine file systems manage the CM’s high-speed disk storage 
(Scalable Disk Arrays or DataVaults), and other I/O peripherals. 

Within CM-5 files, data is stored in canonical (serial UNIX) ordering, thus allow¬ 
ing its use by both serial and parallel systems and processes. When a serial 
process does I/O, data remains in canonical order throughout; for parallel I/O, 
data moves between the canonical order and the ordering required by the 
computational nodes. 

This reordering serves two important purposes. First, it allows a program to run 
on partitions of any size without affecting its I/O: a file written by a process run¬ 
ning on a partition of one size may be read with equal ease by a process running 
on a partition of a different size. Second, it allows the same file to be read by 
parallel or serial processes. A serial process may read a file written by a parallel 
process, and vice versa. 

For further inf ormation on the CM-5 file systems and I/O, see Chapter 20. 


Network Communications 

Data can travel through sockets directly between CM-5 processes and other 
machines on the network. A user process can create a socket, send parallel data 
to it, and have that data received as a serial stream by a serial or vector computer. 
The same socket can carry serial data from control processors; as with file I/O, 
network communication uses standard protocols and data ordering for transmis¬ 
sion, and uses parallel ordering only within the parallel computational nodes. 
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The User’s View 

From the user’s point of view, data from any file system, on any device, appears 
the same and is handled in the same manner. A CM-5 control processor, accessing 
data over the Data Network, sees no difference between data stored on any CM-5 
I/O device and data stored on any other UNIX file system. 

Similarly, user processes are not concerned with the storage me dia on the CM. 
Whether data is stored on a single device or striped across multiple devices, the 
process accesses it as a single file. The only user-visible difference is in perfor¬ 
mance. 
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The Prism programming environment is an integrated Motif-based graphical 
environment within which users can develop, execute, debug, and analyze the 
performance of programs written for the Connection Machine system. It pro¬ 
vides an easy-to-use, flexible, and comprehensive set of tools for performing all 
aspects of Connection Machine programming. 

Separate versions of Prism are available for working with data parallel and mes¬ 
sage-passing programs. Most of the functionality is the same; some features are 
implemented differently, however, taking into account the requirements of the 
different programming styles. 

Users can either load an executable program into Prism, or start from scratch by 
calling up an editor and a UNIX shell within Prism and using them to write and 
compile the program. 

Once an executable program is loaded into Prism, users can (among other 
things): 

* Execute the program. Users can simply start the program running or 
single-step through it Execution can be interrupted at any time. 

■ Debug the program. Users can perform standard dbx-like debugging 
operations such as setting breakpoints and traces, printing the value of a 
variable or expression, and displaying and moving through the call stack. 

■ Analyze the program’s performance. Data an execution time, broken 
down by procedures or by lines of source code, may be displayed as histo¬ 
grams. See Section 6.2. 

■ Visualize data. The values of interactively specified variables or expres¬ 
sions may be displayed in a variety of textual and graphical formats. See 
Section 6.3. 
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In debugging message-passing programs, users can work with PN sets. PN sets 
are predefined or user-created groups of nodes that can be viewed and operated 
on as a single entity. For example, the predefined PN set error contains all 
nodes in the error state; Prism updates the contents of this set as the program 
executes. A user could also define a PN set whose nodes fulfill a specified condi¬ 
tion — for example, all nodes in which the value of x is greater than 0. 
Commands can then be applied to a specific PN set. For example, the user could 
have the nodes in the set execute the next line of code, or could display the value 
of a variable in the nodes. 

Prism operates on terminals or workstations T unning the X Window System. A 
commands-only version is also available for users without access to X. Another 
option lets X users operate with the familiar commands interface, but send cer¬ 
tain output, such as performance data, to X windows. 


6.1 Using Prism 

Figure 14 shows the main window of Prism, with a data parallel program loaded. 
It is within this window that users debug and analyze their programs. Users can 
operate with a mouse, use keyboard equivalents of mouse actions, or issue text 
commands. 

Clicking on items in the menu bar along the top of the window displays pulldown 
menus that provide access to most of Prism’s functionality. 

Frequently used menu items can be moved to the tear-off region, below the menu 
bar, to make them mom accessible. 

The status region displays messages about the program’s status. 

The source window displays the source code for the executable program. The 
user can scroll through this source code or display a different source file. When 
a program stops execution, the source window is automatically updated to show 
the code currently being executed. The user can click on variables or expressions 
in the source code to print their values. The source window can also be split, with 
the assembly code corresponding to the source code appearing in the bottom 
pane. 

The line-number region is associated with the source window. Clicking to the 
right of die line number sets a breakpoint at that line. 
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The command window at the bottom of the main window displays messages and 
output from Prism. The user can also type commands in the command window, 
rather than use the graphical interface. 
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Source File: prlmesl.fcm 


program findprimes 
implicit none 
integer i, n, nextprime 
parameter (n s 70000) 
logical primes(n), candid(n) 
integer identity(n> 


Initialization 


identity * Cltn3 
primes s .false, 
candid » .true. 
candid(l) = .false. 

call loop<n, identity, primes, candid) 

call results<n, primes) 

end 


subroutine loopfn, identity, primes, candid) 
logical primes(n>, candid(n> 
integer identity<n> 
integer i, n, nextprime 


Loop; Find next valid candidate, mark it as a prime, 
invalidate ail multiples as candidates, repeat. 


nextprime = 2 

do while (nextprime .le. sqrt(real<n>>> 
primes(nextprime) = .true, 
candid(nextprime;n:nextprime) = .false, 
nextprime = minval(identity, 1, candid) 


tear-off 

region 


H 


source 

window 


(!) stop at "primesl.fcm ,, :34 


Figure 14. Prism’s main window. 


6.2 Analyzing Program Performance 

In cooperation with the compilers and run-time library routines. Prism provides 
the performance data essential for effectively analyzing and tuning programs. 
For data parallel programs, the data includes: 

■ control processor user and system time 

■ processing time 
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■ time spent transferring data between control processor and nodes 

■ time spent in general Data Network communication 

■ time spent doing specific patterns of Data Network communications, such 
as nearest-neighbor on a grid 

* time spent doing reductions and parallel prefix operations 

For message-passing programs. Prism provides performance data separately for 
each node. The data includes processing time for both the scalar microprocessor 
and the vector units, as well as time spent performing I/O and various array 
operations. 

The performance data is displayed as histograms and percentages. For each type 
of time measurement, the user can also see the data broken down for each proce¬ 
dure and each source line in the program. The data on procedures is available in 
two versions. One gives a flat per-procedure view of the utilization of the re¬ 
source; the other shows utilization using the dynamic call graph of the program. 


6.3 Visualizing Data 

When operating on large arrays or parallel variables, it is often important to 
obtain a visual representation of its data elements. In Prism, the user can create 
visualizers to provide this representation. A wide range of formats is available, 
including: 

■ Text, where the data is shown as numbers or characters 

■ Dither, where values are displayed as a shading from black to white 

■ Colormap, where each data element is mapped to a single color pixel, 
based on a range specified by the user 

■ Threshold, where each data element is mapped to a single pixel, either 
black or white, based on a cutoff value specified by the user 

■ Graph, where values are displayed as a graph, with the index of each data 
element plotted on the horizontal axis and its value on the vertical axis 

■ Surface, which renders the 3-dimensional contours of a 2-dimensional 
slice of data 

■ Vector, which displays complex data as vectors 
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A data navigator allows manipulation of the display window relative to the data 
being visualized. If a parallel array is multidimensional, the visualizer displays 
a slice through the array; the data navigator provides controls for selecting the 
array axes to be displayed and the position of the slice. The user can update a 
visualizer or save a snapshot of it. 

Figure IS shows a surface visualizer. 



Figure 15. A visualizer. 

For message-passing programs. Prism provides an extra dimension for the visu¬ 
alizer, this dimension represents the nodes in a PN set. By moving along this 
dimension, the user can display in turn the visualizer for each node in the set. 


6.4 Using Prism with CMAX 

Prism can be used with programs that have been converted from Fortran 77 to 
CM Fortran via the CMAX Converter. See Chapter 8 for more information on 
CMAX. Prism provides a split-screen option that lets the user view both the CM 
Fortran source code and the corresponding Fortran 77 source code simulta¬ 
neously. See Figure 16. 
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Line 

Source File: ftmp_mnUuserslcmsg7/title/forge/t200.fcm 

Sjj 


O x77; [Version 0.1ft.33 

C* x77: - 

x77: Transformation of TEST From 1200.f 

C* x77: - 

g 


C* x77: Transform D0/ENDD0 <1> J 

O x77: Transform DCVENDDO (3) J 



C* x77: Transform DO/ENDDG (5) J 


1 

program test 

■ 

2 - 


■ 

3 

Include 'test-parameters.me' 

I 

5 

real a2<sizei,size2>, b2<sizel,size2>, c2<sizel,size2) 

I 

e 

logical p2(slzel.slze2> 

I 


CMF* LAYOUT a2CNEUS,;NEUS> 

CMF* LAYOUT b2<:NEUS,:NEUS> 
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end do 
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2 - 

do J = l,slze2 


3 
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4 

p2(i^J> * (mad(ip2>.eq.mod(j,3>> 


5 

end do 


6 

end do 


7 

call 1200(a2. b2, c2. p2, slzel. size2) 


B 

print 10, a2 

I 

10 

include 'test-formats.inc' 

I 

11 


| 

12 

stop 

■ 

13 

end 
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14 
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IS 
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C MERGE: 
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Figure 16. CM Fortran and Fortran 77 source code in a split screen. 


Users can debug, visualize data, or obtain performance data in terms of either the 
CM Fortran or die Fortran 77 source code. 


6.5 On-Line Documentation and Help 

The CM-5 provides complete on-line documentation for its software. CMview, 
Thinking Machines Corporation’s on-line documentation product, lets users dis¬ 
play any CM manual in a format optimized for on-line viewing. Figure 17 shows 
a sample Table of Contents page from a manual. 
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Getting Started in CM Fortran contents: CM5-1 
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Figure 17. A Table of Contents page displayed on-line via CMview. Users can 
click on a hypertext marker to display the desired section. 


In addition, CMview lets users: 

■ Follow hypertext links from table of contents and index entries, cross-ref¬ 
erences, and specially entered hypertext markers. 

■ Search through the entire collection of manuals for a word or phrase. 

■ Print all or part of any manual on a laser printer. 

* Put their own bookmarks (like dog-earing a page) and notes (like scrib¬ 
bling in the margin) on any page of any manual. 
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CMview is accessible by a menu selection from within Prism. It is also available 
by issuing the command cmview from an xterm within the X Window System. 

Prism’s own comprehensive help system is based on the same technology. Help 
is available for each pulldown menu and dialog box. Users in search of more 
information can follow the hypertext link within the help file to display the sec¬ 
tion of the on-line Prism manual in which the topic is discussed in detail. 
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The program execution environment on a CM-5 partition supports both interac¬ 
tive program executive and batch execution. In either case, the program executes 
on the partition manager and accesses the associated set of processing nodes, 
plus I/O devices and other devices (such as graphics workstations) as needed. 

Two job control systems are available. The Distributed Job Manager (DJM) pro¬ 
vides job control and load balancing for both interactive and batch execution. 
Alternatively, the Network Queueing System (NQS) can be used to execute batch 
jobs. 

Access to the interactive environment, therefore, can be achieved through remote 
login or remote shell commands, or through DJM’s j run command. Access to 
the batch environment can be achieved through either DJM’s j sub command or 
NQS’s qsub command. 

The interactive environment is, by default, a timeshared environment. DJM, 
however, can be used to create a dedicated (single-user) interactive environment 
at chosen times. In addition, an administrator may limit access to any given parti¬ 
tion by setting UNIX permissions to grant access only to certain users or 
projects. The a dminis trator can similarly tailor batch queues to the needs of par¬ 
ticular groups of users or types of jobs, and can define times for dedicated batch 
access. The system administrator thus has power, not only to partition the system 
optimally fen* the site’s users, but also to choose the type of environment avail¬ 
able on each partition at any given time: timeshared or dedicated, interactive or 
batch or both. 

To further enhance the program execution environment, the CM-5 offers the 
Prism programming environment (discussed in the previous chapter), with its 
suite of tools for debugging and for performance analysis of both data-parallel 
and message-passing programs. Additional tools, such as the CM timers and the 
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checkpointing facility are also provided, and are discussed at the end of this 
chapter. 


7.1 Batch Utilities 

The CM-5 offers two batch subsystems: 

■ NQS (Network Queueing System), batch system for UNIX. 

* DJM (Distributed Job Manager), which provides both job management 
and load balancing for Connection Machines. 


7.1.1 DJM 

The Distributed Job Manager (DJM) was initially developed at the Minnesota 
Supercomputer Center. It is designed to 

■ manage the flow of jobs through a Connection Machine system 

■ avoid resource conflicts 

* provide a load-balancing capability among the various control processors 
of a CM-5 system 

In order to balance system load, DJM handles all application processes running 
on the CM-5, interactive jobs as well as batch jobs. It allows jobs for which there 
are sufficient resources to execute immediately; it queues all other jobs for later 
execution. 

DJM ensures that it handles all jobs by trapping jobs submitted directly, rather 
than via its own job-submission commands. It has the ability to impose limits 
upon such jobs, or even to kill them when appropriate. 

DJM allows users and administrators alike a great deal of flexibility. Users can 
request either dedicated time or multi-user access. They can specify many 
parameters for their job, or only a few — or, they can accept DJM’s default 
parameters. They can ask that DJM send them mail when a job begins or ends 
execution. They can move a job from interactive execution (“the foreground”) 
to batch execution (“the background”), or vice versa. 
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Administrators can choose how many queues to set up and what restrictions to 
put on various queues (which users can use the queue, what resource limits are 
placed on jobs using this queue, etc.)- They can choose whether to allow dedi¬ 
cated access to the system, and when to allow it. They can determine how 
“foreign” jobs (jobs not submitted via DJM commands) are to be handled. 

DJM does not necessarily perform strict FIFO queueing (although it can do so, 
if so configured). Instead, it sets job priorities by calculating a “score” for each 
running job and for each queued job. Assuming available resources, jobs with 
higher scores execute before jobs with lower scores. 

Scores are recalculated at regular intervals. This allows, for instance, “amount of 
time spent waiting in queue” to raise a job’s score, and “job has exceeded its 
expected CPU usage” to lower a job’s score. 

Users submitting jobs via DJM must specify at least three facts about their jobs: 

■ the number of processors required 

* the CPU time the job is expected to require 

■ the amount of memory the job is expected to require 

DJM treats this information as “soft limits” to help it allocate resources and 
schedule jobs. Jobs that overrun their estimates become vulnerable and may be 
replaced by jobs with higher priority, ifany such jobs are queued and waiting to 
run. The a dminis trator sets both hard and soft limits for queues, and also provides 
values for the parameters used to construct scores for queues and running jobs. 
Thus, the administrator has great control over DJM’s handling of jobs. 

The handling of dedicated access is also flexible. The administrator can choose 
when to put a partition into dedicated mode; whether timeshared jobs executing 
at the changeover time can continue to execute or not; whether the changeover 
will happen automatically, or whether it requires the presence of a job in the 
“dedicated queue” to trigger it; and so on. All these elements of flexibility make 
DJM particularly useful for sites where the CM-5 gets heavy and/or varied usage. 


7.1.2 NQS 

The Connection Machine supports the Network Queueing System (NQS) batch 
system. This batch system supports two types of queues: batch queues, which are 
directed to a specific PM, and which run on the partition that is controlled by that 
PM at the time the job is submitted; and pipe queues, which feed jobs (via batch 
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queues) to aay suitable batch queue that is available to run them. The pipe queue 
can be directed to any batch queue, or only to batch queues that meet specified 
minimal resources. NQS queries current partitions to find one suitable for run¬ 
ning jobs from these queues. 

NQS allows the administrator to control the n umb er and characteristics of queues 
at a site and to define the hours during which each queue will accept and execute 
jobs. Note that the two sets of hours are not necessarily identical: a queue mi ght 
accept jobs from 8 am till midnight, but execute jobs between 8 pm and 8 am. 
(A queue that accepts jobs is said to be enabled; one that executes jobs is said 
to be started.) 


Creating and Configuring Queues 

An NQS manager decides how many queues to create and what characteristics 
each queue will have, thus tailoring the batch system to the needs of the particu¬ 
lar site. The administrator uses the qxngx utility to create each queue, naming and 
describing the queue and def ining 

■ the hours during which the queue operates (queues with restricted hours 
start and stop automatically at designated times) 

" the priority of this queue in relation to other queues 

■ the users or groups of users who can submit jobs to the queue 

■ time and size limitations for jobs executing from the queue 

* the CM system resources available to jobs executing from the queue 


Submitting Batch Requests 

Frequently, the NQS manager defines a number of queues with different charac¬ 
teristics. Users can then choose the queue most suitable for each program. In 
addition, users can further define the execution environment for a program by 
using options to the job submittal command that 

a request that execution be delayed until a particular time 

■ request the use of a specified shell 

■ request that all environment variables be exported with the job 
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■ direct the method by which output is to be handled 

■ set various par-process limits 

■ assign a priority to the job 

Users can also ask for notification by electronic mail of a job’s progress, and can 
query the system for information on the characteristics and availability of queues 
and on the status of queued requests. 


Controlling Batch Queues 

NQS operators can start and stop queues, enable and disable queues, and shut 
down NQS. When necessary, they can also remove waiting and executing jobs 
from queues. 


7.2 Timers 

The CM-5 offers two sets of timers: the global CM timers (which time parallel 
actions across an entire partition) and the single-node CMMD timers. Both sets 
can calculate, with microsecond precision, both the total elapsed time for a pro¬ 
gram or routine and the amount of time during which the nodes are active. 

Calls to CM timers or CMMD timers can be inserted anywhere in a program. A 
program can use (and nest) up to 64 timers for simultaneous coarse-grain and 
fine-grain timing. 


7.3 Timesharing 

The Connection Machine system uses the UNIX timesharing mechanisms, with 
all the adminis trative flexibility they provide. Each partition manager controls 
timesharing on its partition, switching processes in and out as necessary. 
(Because a data parallel process running on the PM plus the nodes is a single 
process, it is switching as a single entity.) 
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7.4 Checkpointing 

Many applications that run on the Connection Machine system require extended 
execution time. Users may need to be able to interrupt and later restart such a 
program for any number of reasons: to allow it to run only when the Systran is 
not needed for other use, to allow for scheduled machine downtime, to protect 
against unscheduled halts, or simply to allow for restarting the program from 
some intermediate state during debugging. The Connection Machine system sup¬ 
ports this need with a checkpointing facility. 

Checkpointing a program lets the user save (and later restart) an executable copy 
of a program’s state. This includes the program’s state on the partition manager 
(PM) and nodes, a list of the files that the program has open at the time of the 
checkpoint, and a stored copy of the checkpointed program. 

The CM checkpointing facility offers three basic methods of checkpointing: 

* inserting checkpoints at particular points in a program 

* having checkpoints occur periodically 

* having a checkpoint occur when a program is sent a particular signal, such 
as the signal sent during a planned shutdown of the system 

Checkpointing can be used from within batch jobs and interactive jobs, including 
those running under cm&bx and Prism. It can be used on programs that execute 
cm the PM only, as well as those that use both the PM and the nodes. 
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CMAX — the “CM Automated X-lator” — is a tool that converts standard 
Fortran 77 into CM Fortran. CMAX provides a convenient migration path for 
serial programs onto the massively parallel Connection Machine system, both 
for data parallel applications and for CM Fortran/CMMD message-passing 
applications. 



In addition, CMAX gives users the option of maintaining their software in For¬ 
tran 77 for maximum portability to multiple platforms. Users in a heterogeneous 
computer environment and third-party software developers can use the converter 
as a “preprocessor” for routine Fortran compilation for CM systems. In this 
sense, CMAX provides a migration path onto and off of the Connection Machine 
system. 

The major difference between serial and data parallel Fortran programs is the 
substitution of array operations for loop iterations, and the concomitant need to 
lay out some arrays across the processing nodes. These are the tasks performed 
by the CMAX converter. 
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CMAX is a DO loop vectorizer; it analyzes loop constructs and translates them 
into CM Fortran array operations. For greatest efficacy, the converter performs 
an interprocedural dependence analysis of the whole program and applies vecto- 
rization techniques such as loop fissioning, scalar promotion, and loop pushing 
to the input code. CMAX also recognizes the intent of numerous programming 
idioms, such as structured data interactions and dynamic array allocation. When 
translating code, it makes full use of powerful Fortran 90 features such as array¬ 
processing intrinsic functions and dynamic allocation statements, as well as the 
forall statement defined by High Performance Fortran. CMAX thus provides 
entree both to the Connection Machine system and to the emerging HPF standard. 

CMAX provides a convenient intraface to the user. The Prism development envi¬ 
ronment provides facilities for examining CMAX output and comparing it 
line-by-line with the input program. CMAX command options and in-line direc¬ 
tives allow the user to control the converter’s actions and decision rules. The 
CMAX library provides canonical, portable — and translatable — Fortran 77 
utilities for expressing common operations like dynamic array allocation and 
circular array element shifts. The converter generates detailed notes of a conver¬ 
sion, explaining all the changes it has made. 

Although CMAX is designed primarily to assist in the creation of new applica¬ 
tions, it accepts as input any program that is written in standard Fortran 77 and 
follows standard guidelines for scalability. These simple guidelines guarantee 
that a program runs efficiently on any size data set, large or small, and on any 
number of processors, from one to thousands. The combination of guidelines 
plus converter can assist substantially the task of upgrading “dusty deck” pro¬ 
grams to take advantage of modem architectures and language features. 



Scalable 
Fortran 77 
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The conventions of scalable Fortran programming express three basic objectives: 

■ Make it easy for a compiler to recognize how data and computations may 
be split up for independent or coordinated processing. For example: loop 
over as many array axes as possible in a single operation; use standard 
idioms to express common, well-structured data dependences. 

111 Avoid constructions that rely on a particular memory organization, such 
as linearizing multidimensional arrays or changing army size or shape 
across program boundaries. 

■ Use data layout directives and library procedures (with some conditiona- 
lizing convention) to take advantage of the specific performance 
characteristics of each target platform. For example, Fortran 77 programs 
targeted to the CM system can use compiler directives to fine-tune data 
layout and access the CM libraries for procedures that are specially tuned 
for performance on the CM system. 
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Fortran for the Connection Machine system is standard Fortran 77 supplemented 
with the array-processing extensions of the ANSI and ISO (draft) standard For¬ 
tran 90. These extensions provide convenient syntax and numerous intrinsic 
functions for manipulating arrays. 

Newly written Fortran programs can use the array extensions to express efficient 
data parallel algorithms for the CM. These programs will also run on any other 
system, serial or parallel, that implements Fortran 90. CM Fortran also offers 
several extensions beyond Fortran 90, such as the forall statement and some 
additional intrinsic functions. These features are well known in the Fortran com¬ 
munity and are particularly useful in data parallel programming. 


9.1 Structuring Parallel Data 

Fortran 90 allows an array to be treated either as a set of scalars or as a first-class 
object. As a set of scalars, array elements must be referenced explicitly in a do 
construct. In contrast, a reference to an array object is an implicit reference to all 
its elements (in unspecified order). For example, to increment the elements of the 
100-element array a by 1, a program can reference the array either way: 

A as a set A as an 

of scalars object 


DO 1-1,100 

A (I) - A (I) +1 A = A + 1 

END DO 
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To operate on multidimensional arrays, DO loops must be nested to reference 
each element explicitly. In the statement however, a could be a scalar, 

a vector, a matrix, or a higher-dimensional array. 

CM Fortran takes advantage of this standard feature when allocating arrays on the 
CM system. An array that is used only as a set of scalars is stored and processed 
on the partition manager in the normal serial manner. Any array that is referenced 
as an object is stored in node memory, one element per processor, and processed 
in parallel. In essence, the partition manager executes all of CM Fortran that is 
Fortran 77, and the nodes execute all the array extensions drawn from Fortran 90. 
No new data structure is required to express parallelism. 


Partition Manager 




Processing Nodes 




■ Array objects 
" Fortran 90 operations 


The simple array reference A may be written more explicitly using a triplet sub¬ 
script, A (l ! 100 : 1) , which resembles the control specification of a DO loop. 
Using triplet subscripts, you can replace one or more do loops with an array ref¬ 
erence that indicates all the elements of interest — and thereby cause the array 
to be processed in parallel. 

An implicit triplet — that is, the array name alone — is usually used for whole 
arrays. You can, however, explicitly specify any of the index variables, just as in 
a DO loop, to indicate a section of the array. For example, some sections of array 
b( 4,6) are: 


B(l:2, :) 



B (3 :4,4 : 6) 



B(s,2:6:2) 


B(3, :) 
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Array sections can be used anywhere that whole arrays are used — in expres¬ 
sions and assignments and as arguments to procedures. 


9.2 Computing in Paraiiel 

The most straightforward form of data parallel computing is elemental comput¬ 
ing, that is, operating on array elements all at the same time, each independently 
of the others. An assignme nt statement where the entire array is referenced as an 
object has this effect. For example, consider the following assignment statement 
for an 8 x 8 x 8 array c: 

C = C**2 

The CM system allocates one element of c in each of 512 processors, and all the 
processors operate on their respective elements of C at the same time. 

An expression or assignment can involve any number of arrays or array sections, 
as long as they are all of the same shape. Scalars can be intermixed freely in array 
operations, since Fortran 90 specifies that a scalar is effectively replicated to 
match any array. For example, the following statement assumes that D and E are 
10 x 10 matrices and F is a 10 x 100 x 100 array: 

D = E*2.0 + 1.0 + F(:,1:10,3) 

Another form of array operation uses an elemental intrinsic function. Fortran 90 
extends most of the intrinsic functions of Fortran 77 so that they can take either 
a scalar or an array as an argument. If G is an array, this statement operates ele¬ 
mentally: 

G = SIN(G) 

An array assignment can be performed conditionally if it is constrained by a 
where statement. This statement includes a logical mask; it behaves like a do 
loop with an embedded IF statement (except that the order in which elements are 
processed is unspecified). For example, to avoid division by zero in an array 
V assignment, one might say: 

WHERE (D.NE.0) E = E/D 
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Finally, CM Fortran offers a form of elemental array assignment, the forall 
statement, whose action is position-dependent. The syntax of a forall state¬ 
ment resembles a do construct, but die assignments can be executed in parallel. 
For example, to initialize H as a Hilbert matrix of size n: 

FORALL (1=1:N, J=1:N) H(I,J) = 1.0 / REAL{ I+J-l ) 

forall can use a mask to make its action dependent on either the value or the 
position of the individual array elements. For example, to clear matrix h below 
the diagonal, one can set a mask to select those positions where row index I is 
greater than column index J: 

FORALL (1=1:N, J=1:N, I.GT.J ) H(I,J) = 0.0 
To initialize a table of integer logarithms: 

FORALL (1 = 1:10) LG (2** (I - 1) : 2**1 - 1) = I - 1 


9.3 Communicating in Paraliel 

A second form of data parallel computing requires processors to access each oth¬ 
er’s memories, all at the same time. The pattern of interprocessor communication 
can be either regular (grid-based) or arbitrary. Fortran 90 defines a number of 
features that move data from one array position to another; these features map 
naturally onto the communication mechanisms implemented in CM hardware. 


Grid-Based Communication 


Many applications, such as convolutions and image rotation, need to move data 
in regular grid patterns. One way to specify such motion in Fortran 90 is by 
assigning array sections. For example, to shift vector values to the left: 


V(1:9} = V(2il0) 



V(2:10} 

V(ls 9) 
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To shift data on more than one dimension: 



Fortran 90 also defines intrinsic functions that perform grid-based data motion. 
The function cshift performs a circular shift of array elements, and eoshift 
performs an end-off shift. For example, the following statement shifts the ele¬ 
ments on the second dimension of A by one position to the left and assigns the 
result to B. (The shift argument can also be an array, which shifts the rows by 
different offsets.) 

B - CSHIFT( A, DIM-2, SHIFT-1 ) 

One notable use of cshift is in so-called “stencils,” array expressions that 
compute a weighted sum of neighboring points of a specific grid point A simple 
example would be 

A- C3 *B + C1*CSHIFT {B, DIM-1, SHIFT—1) + C2* CSHIFT (B, DIM-2 , SHIFT—1) 

The CM Fortran compiler includes optimizations that provide particularly high 
performance for stencils. 


General Communication 

Processors must communicate in arbitrary patterns to map an unstructured prob¬ 
lem onto a grid or to index into arbitrary locations of an array. To perform these 
operations in parallel, CM Fortran provides vector-valued subscripts and FOR- 
ALL. 

A vector-valued subscript is a form of array section that uses a vector of index 
values as a subscript. If A is a vector of length 10 and P is an array containing 
a permutation of the integers from 1 to 10, then A - A (P) applies this permuta¬ 
tion to the values in A. The statement A(P) - A applies the inverse permutation. 

The index values can be repeated, which causes element values to be repeated 
in the section. For example, if v is the vector (/2,6,4, 9,9/) , then A(v) is a 
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five-element vector whose values are a<2), a(6) , A(4) , a(9) , and a(9) , in 
that order: 




A(V) 


The FOSAZiL statement provides the same arbitrary indexing into an array of any 
rank. For example, the following statement uses the two-dimensional index 
arrays x and Y to permute the values of a two-dimensional array B: 

FORALL C(I,J) = B( Y(I,J) ) 


9.4 Transforming Parallel Data 

Fortran 90 defines a rich set of intrinsic functions that take an array argument and 
construct a new array (or scalar). All these transformational functions take only 
array objects (not arrays subscripted in the Fortran 77 manner), and all are there¬ 
fore computed in parallel on the CM. 

One set of transformational functions is the reduction intrinsics, such as sum or 
MAXVAL. These functions apply a combining operator to the elements of an array 
(or array section) and return the result as a scalar. For example, given a 100 x 500 
matrix D, the following expression returns the sum of the elements in the upper 
left quadrant: 

SUM( D(1:50,1:250) ) 

These functions can take a mask argument to make the reduction conditional. If 
applied only to a specified dimension, they return an array of rank one less than 
the argument array. For example, given the 100 x 500 matrix D, the following 
expression returns a 100-element vector containing the sums of the positive ele¬ 
ments in each row. 

SUM( D, DIM=2, MASK-D.GT.O ) 

A parallel prefix, or scan, operation applies a combining operator cumulatively 
along a grid dimension, giving each element the combination of itself and all 
previous elements. These operations, which are useful in such algorithms as 
line-of-sight and convex-hull, can be expressed with the forall statement and 
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a reduction function. For example, in the following add-scan (or sum-prefix) 
operation, each element of b gets the sum of all elements up to and including the 
corresponding element of a: 

FORALL (1=1:N) B(I) - SUM( A(1:I) ) 

The array construction functions transform arrays in a wide variety of ways. For 
example, transpose performs matrix transposition; reshape constructs a new 
array with the same elements as the argument but a different shape; pack and 
unpack behave as gather/scatter operations; and spread replicates an array 
along a new dimension. CM Fortran also provides the Fortran 90 array multi¬ 
plication functions, dotproduct and matmul In addition to the standard 
Fortran 90 intrinsics, CM Fortran also offers the functions diagonal, 
REPLICATE, RANK, PROJECT, FIRSTLOC, and LASTLOC. 
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C* is an extension of the C programming language designed to support data par¬ 
allel programming. 

The C* language is based on die standard version of C specified by the American 
National Standards Institute (ANSI). C programmers will find most aspects of C* 
code famili ar to them. C language constructs such as data types, operators, struc¬ 
tures, pointers, and functions are all maintained in C*; new features of Standard 
C such as function prototyping are also supported. C* extends C with a small set 
of new features that allow programmers to use the Connection Machine system 
efficiently. 

C* is well suited for applications that require dynamic behavior, since it allows 
the size and shape of parallel data to be determined at run time. In addition, it 
provides programmers with all the standard benefits of C, such as block struc¬ 
ture, access to low-level facilities, string manipulation, and recursion. C* also 
provides a straightforward method for calling CM Fortran subroutines from a C* 
program. 


10.1 Structuring Parallel Data 

In C*, data is allocated on the processing nodes only when it is tagged with a 
shape. A shape is a way of logically configuring parallel data. C* includes a new 
construct called left indexing that is used in declaring a shape. The left index 
specifies the number of dimensions (or axes) in the shape and the number of 
positions along each dimension. Positions correspond to processors (or virtual 
processors). For example, 

shape [25][51]s; 
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declares a shape s that is laid out as a 25 x 51 grid cm the processing nodes. 

This shape is considered to be fully specified, since the number of dimensions 
and positions are provided at compile time. Shapes may also be partially speci¬ 
fied or fully unspecified. C* lets the programmer dynamically allocate and 
specify shapes, thus providing flexibility in the way they can be used. 

Once a shape has been fully specified, one can declare parallel variables of that 
shape. Parallel variables have both a Standard C data type and a shape. For exam¬ 
ple, the code 

shape [16384]t; 

int:t parallel_intl, parallel_int2; 
float:t parallel_floatl; 

declares three parallel variables of shape t; each consists of 16384 elements, laid 
out along one dime nsion. Parallel variables interact most efficiently when they 
are of the same shape. In addition to the above method, parallel variables can also 
be allocated dynamically. 

C* also provides parallel versions of arrays and structures. For example, the code 

shape [16384]t; 
int:t parray[16]; 

declares a parallel array, parray, which consists of 16 parallel ints of shape t. 
The code 

shape [16384]t; 
struct scalar_struct { 
int a; 
float b; 

} ; 

struct scalar_struct:t pstruct; 

declares a parallel structure, pstruct, that consists of the Standard C structure 
scalar_struct replicated in each of the 16384 positions of shape t. 

C* includes pointers to both shapes and parallel variables. As in Standard C, C* 
pointers are fast and powerful. 
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mm 


10.2 Computing in Parallel 

Parallel Use of Standard C Operators 

C* extends the use of Standard C operators, through overloading, to apply to 
parallel data as well as scalar data. For example, if pi, p2, and p3 are all parallel 
variables of the same shape, the statement 

p3 = p2 + pi; 

p erforms a separate addition of the values of pi and p2 in each position of the 
shape and assigns the result to the element of p3 in that position. The additions 
take place in parallel. If pi or p2 were not a parallel variable, it would first be 
promoted to parallel, with its value replicated in every element. Note that this 
line of code looks exactly like Standard C; the result differs, however, depending 
on whether the variables are parallel or scalar. 


The with and where Statements 

C* adds new statements to Standard C that allow operations on parallel data. 

The with statement selects a current shape. In general, parallel variables must 
be of the current shape before parallel operations can take place on them. For 
example, code like the following is actually required to perform a parallel addi¬ 
tion like the one shown above: 

shape [16384]t; 
int:t pi, p2, p3; 

with (t) 

p3 ■ p2 + pi; 

C* also adds a where statement to restrict the set of positions on which opera¬ 
tions are to take place; the positions to be operated on are called active. Selecting 
the active positions of a shape is known as setting the context. The where state¬ 
ment in the following example ensures that division by 0 is not attempted: 

with (t) 

where (pi 1- 0) 

p3 * p2 / pi; 

Serial code always executes, no matter what the context. 
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Programs may contain nested where statements; these cumulatively shrink the 
set of active positions. The context is passed into functions called within the 
scope of a where statement and is correctly reestablished when returning to an 
outer level as a result of a break, continue, goto, or return statement. Note 
that the context does not affect the flow of control of a program. One can still 
use Standard C statements such as if and while to manipulate flow of control. 

C* extends the Standard C else statement for use in conjunction with the where 
statement; using else after a where reverses die set of active positions. The new 
everywhere statement makes all positions active. 


New Operators 

C* adds a few new operators to Standard C. For example, the <? and >? opera¬ 
tors are available to obtain the minimum and maximum of two variables (either 
scalar or parallel). The corresponding compound assignment operators <?*= and 
>?= are also included. The operator %% provides a true modulus operation (as 
compared to the remainder operator %). 


Parallel Functions 

Functions in C* can pass and return parallel variables and shapes. If it is not 
known what the current shape will be when the function is called, you can use 
the new keyword current in place of a specific shape name within the function 
declaration; current always means the current shape. 

A useful feature of C* is overloading of functions. C* allows you to declare more 
than one version of a function with the same name — for example, one version 
for scalar data and another for parallel data. The compiler automatically chooses 
the right version. 
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10.3 Communicating in Parallel 

C* provides two methods of parallel co mmuni cation: as part of the syntax of the 
lan g ua g e and via an extensive library of functions. Both allow communication 
in regular patterns wi thin shapes and in irregular patterns both within and 
between shapes. 


Regular Communication 

C* uses the intrinsic function pcoord to provide a self-index for a parallel vari¬ 
able along a specified axis of its shape. For example, if pi is of a 
one-dimensional shape with 16384 positions (and the shape is current), pcoord 
initializes pi as shown in Figure 18. 


pi - pcoord(0); 
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Figure 18. The use of pcoord with a one-dimensional shape. 


The pcoord function is typically used to provide regular communication — 
called grid communication in C* — along the axes of a shape. For example, the 
following code sends values of source to the elements of dest that are one 
coordinate higher along axis 0: 

[pcoord{0) + lldest = source; 

In the common case where pcoord is called within a left index expression, and 
the argument to pcoord specifies the axis indexed by the left index, C* allows 
a shortcut: the call to pcoord can be replaced by a period. Thus, for a two-di¬ 
mensional shape, the following provides grid communication along both axis 0 
and axis 1: 

[.+1] [. -2] dest = source; (A chess knight’s move) 
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Wrapping from one end of an axis to the other is provided by a Standard C* 
programming idiom that involves the use of pcoord along with the new modu¬ 
lus operator %% and the dimof intrinsic function, which returns the number of 
positions along an axis of a shape. 

Library functions are also available to perform grid c ommunication . For exam¬ 
ple, the to_grid_d±m and to__grid functions can be used in place of the 
statements above. 


Irregular Communication 

C* uses the concept of left indexing to provide communication between different 
shapes, as well as within-shape communication that does not necessarily occur 
in regular patterns. 

A left index can be applied to a parallel variable. If the index itself is a parallel 
variable, the result is a rearrangement of the values of die parallel variable being 
indexed, based on the values in the index. If the index is of one shape and the 
parallel variable being indexed is of another shape, the result is a remapping of 
the parallel variable into the shape of the index. Thus, in the assignment 

dest = [index]source; 

the parallel variable dest gets values from source; the values in index indi¬ 
cate which element of source is to go to which element of dest. The variables 
dest and index must be of the current shape; source can be of any shape. This 
is known as a get operation. Putting the index variable on the left-hand side spec¬ 
ifies a send operation. Sends are roughly twice as fast as gets. The operations can 
also be performed with the send and get functions in the C* c ommunication 
library. 


10.4 Transforming Parallel Data 

C* provides operators and library functions that enable programmers to easily 
perform common transformations of parallel data. 

C* overloads the meaning of several Standard C compound assignment operators 
to provide a succinct way of expressing global reductions of parallel data. For 
example, +■, when applied as a unary operator to a parallel variable, s ums the 
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values of all active elements of the parallel variable. The resulting value can be 
treated the same way as the result of a serial operation. Similarly, the | - operator 
performs a bitwise OR of all elements of a parallel variable. The reduce and 
global library functions provide similar capabilities for various operations. 

The C* communication library contains many functions that perform other trans¬ 
formations of parallel data. For example: 

* The scan function calculates running results for various operations on a 
parallel variable. 

a The spread function spreads the result of a parallel operation into ele¬ 
ments of a parallel variable. 

■ The rank function produces a numerical ranking of the values of parallel 
variable elements; this ranking can be used to rearrange the elements into 
sorted order. 
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The *Lisp language is a high-level programming language for the Connection 
Machine system. Based on the Common Lisp progr amming language, *Lisp 
allows you to write data parallel programs for the CM using the data types, pro¬ 
gramming constructs, and programming style of Lisp. Programs written in *Lisp 
make full use of CM hardware, yet at the same time retain the clarity, expressive¬ 
ness, and flexibility of Lisp. 

The *Lisp language extends the Common Lisp language by providing parallel 
equivalents for die basic operations of Common Lisp, along with operations that 
are unique to data parallel programming, such as processor selection, parallel 
prefix calculations, interprocessor communication, and data shape specification. 

A *Lisp program is simply a Common Lisp program that includes calls to *Lisp 
operators. A call to a *Lisp operator causes all active CM processors to execute 
that operation in parallel. Thus, *Lisp is fully compatible with C ommo n Lisp; 
programs written in Common Lisp will run unmodified in *Lisp. 

*Lisp functions and macros are defined via defun and defmacro, just as in 
Common Lisp. *Lisp programs are compiled by the *Lisp compiler, which 
includes (and is invoked in the same ways as) the Common Lisp compiler. This 
means that programs in *Lisp and Common Lisp can be written, compiled, and 
tested with the same editors and debuggers. 
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11.1 Structuring Parallel Data 

Scalar and Parallel Data 

*Lisp is an extension of Common Lisp and therefore includes all the standard 
Common Lisp data types. These data types are collectively referred to as scalar 
data. *Lisp also supports an additional parallel data type, called a pvar. A pvar 
is a parallel variable, that is, a single variable with a separate, modifiable value 
in each processor of the CM. Operations performed on a pvar are performed 
simultaneously by all active CM processors, with each processor modifying only 
its own value for the pvar. Many of the scalar data types in Common Lisp have 
corresponding pvar equivalents. The eight basic pvar data types are boolean, 
integer, floating-point, complex, character, array, structure, and front-end value. 


Creating Pvars In *Llsp 

There ate three basic ways to create, or allocate , a pvar in *Lisp, each designed 
to serve a specific purpose, as shown in the examples below: 

(!! 5) ;; Allocating a temporary pvar 

(defpvar my-five-pvar 5) ;; Allocating a permanent pvar 

(*let ((my-pi!! pi)) ;; Allocating a local pvar 

{*!! 2 my-pii!)) 

As these examples show, *Lisp supports temporary, permanent, and local pvars. 

* Temporary pvars are allocated by the ! I (bang-bang) function, which 
takes a single scalar value as its argument and returns a temporary pvar 
with that value in every processor. 

■ Local pvars are allocated by the *let and *let* functions. They exist for 
the duration of a body of *Lisp code. 

■ Permanent pvars are allocated by the defpvar function. They remain in 
existence until specifically deallocated. 
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Defining the Shape of the Data 

The shape of the data stored in a pvar is determined by a grid of processors that 
the CM is currently simula ting. The defining property of a processor grid is its 
geometry: the rank of the simulated grid and the sizes of its dimensions. 

The combination of a particular grid geometry and a set of pvars that share that 
geometry is called a virtual processor set (VP set). For example, the expression 

(def-vp-set my-vp-set '(64 64) 

:*defvars ((x 1 nil fixnum-pvar) 

(y 1.0 nil single-float-pvar))) 

defines a VP set named my-vp-set with 64 x 64 processors and associates two 
permanent pvars with it: an integer pvar x and a single-precision floating-point 
pvar y. 

Because the CM can simulate many grids within a single program, *Lisp uses the 
concept of a current VP set to determine which VP set is active. Unless otherwise 
specified, all pvar operations take place within the current VP set. If no VP set 
has been defined, all pvar operations occur within a default VP set that is auto¬ 
matically defined whenever *Lisp starts up. 


Processor Addressing 

An important feature of the simulated grids defined by VP sets is that they permit 
the assignment of addresses to processors. There are two basic methods used to 
assign addresses to processors on the CM: send addressing and grid addressing. 

Each processor has a unique numeric send address based upon its location within 
the physical hardware, accessible via the *Lisp operation (self-address M). 

Each processor also has a grid address, a sequence of coordinates that defines its 
position in the n-dimensional grid of processors the CM is currently simulating. 
The *Lisp operation (self-addzess-gridl! n) returns a pvar whose value 
in each processor is the coordinate of that processor along the nth dimension of 
the current grid. 
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Accessing and Copying Paraliel Data 

*Lisp allows you to access pvar values on a per-processor basis, to copy the 
value of one pvar into another, and to display the elements of a pvar over a range 
of processors. For example: 

■ (pref my-pvar io) returns the value of my-pvar in processor 10. 

■ (*set£ (pref my-pvar 10) 123) stores the quantity 123 into proces¬ 
sor 10 of my-pvar. 

■ (*set£ (pref my-pvar (cube-from-grid-address 5 7)) 111) 
stores 111 into my-pvar at grid location (5,7). 

■ (*set pvarl pvar2) copies the contents of pvar2 into pvarl in all 
active processors. 

■ (*set pvarl 5) stores the value 5 into pvarl in all active processors. 

The *Lisp operation ppp (short for pretty-print-pvar) displays the values 
of a pvar. For example, the expression 

(ppp (self-address!!) :end 20) 

displays the send addresses of the first 20 processors: 

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 


11.2 Computing in Paraiiei 

The parallel operations supplied by *Lisp are modeled very closely on the exist¬ 
ing scalar operations of Common Lisp and include parallel equivalents for most 
Common Lisp functions and macros. These parallel operations typically have the 
same name as their scalar Common Lisp counterparts, with either the characters 
“ 11 ” added to the end or an asterisk appended to the front. The characters 
“! I ” are meant to resemble the mathematical symbol jj, which means parallel. 
The asterisk similarly denotes the concept of an operation taking place in paral¬ 
lel. For example, the parallel version of the Common Lisp mod function is 
mod 11, and the Common Lisp if operator has two *Lisp equivalents, if ! I and 
*if. 
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Most *Lisp operators take pvars as arguments and return a pvar result. In general, 
if a Common Lisp operation takes arguments of a specific data type, the *Lisp 
equivalent for that operation takes pvars of that data type as arguments and 
returns an appropriately typed pvar result. 

For example, the functions + t!,-M,*n, and / \ i perform the same operations 
as the Common Lisp functions +, -, *, and /, but take numeric pvars as argu¬ 
ments and perform the appropriate arithmetic operation in parallel. The *Lisp 
expression 

(*set pvar2 (+!! pvarl (*!! pvarl pvar2))) 

multiplies the values of pvar 1 and pvar2 in all active processors, adds the value 
of pvarl, then stores the result in pvar 2 . 

*Lisp includes parallel versions of Common Lisp functions for many data types, 
including operations for complex and character pvars. *Lisp also includes an 
extensive selection of operators for manipulating array, vector, string, sequence, 
and structure pvars. There are even operations that allow you to create pvars that 
reference front-end data structures (such as symbols and lists). 

In addition, *Lisp redefines many Common Lisp operations so that they will 
accept pvar arguments and will call the appropriate *Lisp operations to compute 
the result. This means that the above *set example can be rewritten as: 

(‘set pvar2 (+ pvarl (* pvarl pvar2))) 


Selection of Active Sets of Processors 

Parallel computations can be performed in all processors simultaneously, or in 
a specific subset of active processors selected by the user. Pvar values in inactive 
processors are not changed. *Lisp provides several macros for selecting the 
current set of active processors (sometimes referred to as the currently selected 
set). 

The most basic processor selection operators are ‘when and ‘unless. Similar 
to their Common Lisp counterparts, these operators conditionally evaluate a 
body of code based on the result of a test. The difference is that the test controls 
which processors will evaluate the code, not whether the code will be evaluated 
at all. In the following code sample, ‘when is used to select all processors with 
odd send addresses. The value of my-pvar in those processors is then negated. 
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(*set my-pvar (self-addxess!!)) 

(*when (oddp!! (self-address!!)) 

(*set my-pvar (-!! my-pvar))) 

(ppp my-pvar rend 19) 

0-12-34-56-78-9 10 -11 12 -13 14 -15 16 -17 18 

The *all construct unconditionally selects all processors for the duration of a 
body of *Lisp code. For example, evaluating the expression 

(*all (*set my-pvar 10)) 

ensures that the value of my-pvar in all processors is 10, regardless of the state 
of the currently selected set. 


11.3 Communicating in Parallel 

Like all CM languages, *Lisp supports both regular and irregular communica¬ 
tion. For example: 

■ news i! causes each active processor to get a value from another proces¬ 
sor a fixed distance away on the grid. 

■ *news causes each active processor to send a value to another processor 
a fixed distance away on the grid. 

■ pref i! allows each active processor to get a value from any other proces¬ 
sor in the grid. 

■ *pset allows each active processor to send a value to any other processor 
in the grid. 

If two or more processors attempt to read the data of a single processor, they all 
receive the same correct data. If two or more processors attempt to write to the 
same address, the user can specify how they are to be combined (for instance, by 
summing the values). 
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11.4 Transforming Parallel Data 

*Lisp contains many functions to help perform transformations on data. These 
include operators computing parallel prefixes (scanning) of data, spreading data 
across the processors of the CM, and sorting and enumeration of pvar values. 
Some examples: 

* scan I i and segment-set-scan I I permit the selection of many kinds 
of scanning operations, such as addition/multiplication of values; taking 
the maximum and minimum of values; taking the logical/arithmetic AND, 
OR, and XOR of values; and even simply copying values across the proces¬ 
sor grid. 

The scam!! operation accepts a segmentation argument for simple uses 
of this feature. The segment-set-scan!! operation uses a special type 
of pvar, a segment set pvar, to allow much finer control over the segmenta¬ 
tion of processors than scan I I provides. 

■ spread I I replicates the value of a pvar at a given coordinate to all pro¬ 
cessors along a selected dimension of the currently selected grid. A related 
operation, reduce-and-spreadl I, combines the operations of scan¬ 
ning and spreading. 

■ The sort I! operator reorders the values of a numeric pvar into ascending 
order. 

“ The enumerate I ! operator assigns to each currently active processor a 
distinct integer between 0 (inclusive) and the number of active processors 
(exclusive). 
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The Connection Machine Scientific Software Library (CMSSL) is a rapidly 
growing set of numerical routines that support computational applications while 
exploiting the massive parallelism of the Connection Machine system. The 
CMSSL provides data parallel implementations of familiar numerical routines, 
offering new solutions for performance optimization, algorithm choice, and 
application design. CMSSL routines have been designed for users of languages 
with array syntax (for example, CM Fortran, High Performance Fortran, and C*). 

The CMSSL includes routines for solving linear algebraic equations, solving 
ordinary and partial differential equations, signal processing, statistical analysis, 
and optimization. The library also provides a set of communication functions 
that offer a strong base for the development of computational tools. These func¬ 
tions support computations on problems represented by both structured and 
unstructured grids. For computations on unstructured grids, the CMSSL offers 
routines for efficient load balancing of both arithmetic and communication. 


12.1 Overview 

The current version of the CMSSL concentrates on six critical areas of scientific 
progr amming : 

■ numerical linear algebra 

■ matrix operations on dense, grid sparse, and arbitrary sparse 
matrices 

■ linear equation solvers for dense, banded, and sparse systems of 
equations 
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• eigensystem analysis of dense symmetric, tridiagonal, and sparse 
systems 

■ Fourier Transforms (complex-to-complex, real-to-complex, and complex- 
to-real) 

■ ordinary differential equations 
B optimization 

■ random number generation 

■ statistical analysis 

The library also includes optimized communication functions important to struc¬ 
tured and unstructured grid computations for the solution of partial differ ential 
equations and optimization problems: 

■ polyshift 

■ all-to-all broadcast and reduction 
" matrix transpose 

* gather and scatter 

■ partitioning 

■ communication compiler 


12.2 Multiple Instances 

Most CMSSL linear algebra routines are designed to support multiple instances. 
They allow multiple independent matrices to be solved, transformed, or multi¬ 
plied concurrently. In addition, they allow multiple vectors or multiple 
right-hand sides, where relevant, to be associated with each matrix to be multi¬ 
plied or solved. The difference between invoking computation on a single 
instance and on multiple instances lies only in the dimensionality and layout of 
the data structures used as parameters to the particular CMSSL routine. 

As an example, consider the linear equation solvers for banded systems. For the 
tridiagonal case, the parameters to these routines include three vectors that con¬ 
tain the upper, main, and lower diagonals of a tridiagonal system, and a fourth 
vector that contains the right-hand-side values for the system. Upon completion 
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the solution overwrites the right-hand side. One routine interface supports four 
different degrees of computational concurrency: 

■ A single system may be solved. 

* A single system may be solved for multiple right-hand sides. 

" Multiple systems may be solved for a single right-hand side each. 

B Multiple systems may be solved, each for multiple right-hand sides. 

To solve a single system, one specifies the upper, main, and lower diagonal argu¬ 
ments as one-dimensional (see Figure 19). 



Figure 19. A single tridiagonal system with a single right-hand side. 


To solve for multiple right-hand sides, one gives the right-hand-side argument 
(which will be replaced by the solutions) an in-processor (serial) dimension 
equal to the number of right-hand sides ( nrhs ) (see Figure 20). 
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To solve multiple systems, one specifies the upper, main, and lower arguments 
with two dimensions: one for the coefficients of the system and one to specify 
how many systems are represented. The right-hand side (solution) argument is 
similarly specified in two dimensions (see Figure 21). 



matrices 


solu- right-hand sides 

tions 


Figure 21. Multiple tridiagonal systems with single right-hand side for each system. 


To solve multiple systems each with multiple right-hand sides, one specifies the 
right-hand-side (solution) argument in three dimensions: one is the length of the 
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vector, and along this dimension lie the right-hand-side values; one is the number 
of systems (n); and one is the number of right-hand sides ( nrhs ) per system (see 
Figure 22). 



b(°K..b< nrhs - 1 l l; ^ 
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Figure 22. Multiple tridiagonal systems with multiple right-hand sides for each system. 


The benefit of using CMSSL routines to solve a single instance of a linear prob¬ 
lem lies in the speed gained by exploiting the parallel architecture of the 
Connection Machine system. Computations on matrices require numerous repet¬ 
itive calculations along one or both axes. On a serial machine, these must be done 
one at a time, but on a parallel machine they can be done all at once. 

Using CMSSL to solve multiple instances of a linear problem offers similar, but 
perhaps greater, benefits. For applications that require solving many systems or 
decomposing many matrices, it is no longer necessary to iterate over the set of 
systems; the solutions can be computed concurrently. 
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12.3 Matrix Operations 

Basic Linear Algebra Routines for Dense Matrices 

The basic linear algebra routines for dense matrices perform die operations listed 
below. For some operations (inner product, outer product, matrix vector multi¬ 
plication, vector matrix multiplication, and matrix multiplication), the library 
includes a family of related routines, each performing a variation on tire basic 
operation. For example, some routines overwrite the supplied destination with 
the results of the operation; others add the results to values you supply; and some 
take the transpose, conjugate, or Hermitian of one or more operands. 

* Inner Product Routines. Compute the inner products of one or more pairs 
of vectors, or the global inner product over all axes of two arrays. For 
complex data, you can conjugate the first operand vector. 

* 2-Norm Routine. Computes the 2-norms of one or more real or complex 
vectors, or the global 2-norm of a real or complex array. 

■ Outer Product Routines. Compute the outer products of one or more pairs 
of vectors. For complex data, you can conjugate the second operand vec¬ 
tor. 

* Matrix Vector and Vector Matrix Multiplication Routines. Compute one or 
more matrix vector (or vector matrix) products. For complex data, you can 
conjugate the matrix. 

* Matrix Multiplication Routines. Compute one or more matrix products. 
Variants of the basic routines take the transpose of one or both operand 
matrices before computing the product; for complex data, you can take the 
Hermitian of either operand. 

* Infinity Norm Routine. Estimates the infinity norms of one or more 
matrices. 

■ Matrix Multiplication Routine with External Storage. Computes a matrix 
product, where one matrix is too large to fit into core memory and is stored 
in a file. 
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Basic Linear Algebra Routines for Sparse Matrices 

The CMSSL provides routines for basic linear algebra operations on sparse 
matrices representing structured and unstructured grids. Both elementwise and 
block sparse matrices are supported. The following operations are provided in 
each of these categories: 

■ Sparse matrix X vector 

■ Vector X sparse matrix 

■ Sparse matrix X dense matrix 

■ Dense matrix X sparse matrix 

Future versions of the CMSSL will introduce most of the other basic linear alge¬ 
bra operations for arbitrary and grid sparse matrices. 

The primary intent of the arbitrary sparse matrix operations is to provide the 
basic building blocks for more complex sparse applications — for example, a 
sparse iterative solver, or computation of the eigenvalues of sparse matrices by 
the Lanczos or the Amoldi method. 

For applications that do not perform explicit sparse linear algebra operations, but 
want to make use of some communication primitives used by the sparse basic 
linear algebra functions, the CMSSL provides two utility functions: the sparse 
gather utility and the sparse scatter utility (described in Section 12.11). These 
utilities are intended for use in applications such as the solution of partial differ¬ 
ential equations on unstructured discretizations, and optimization problems 
represented by sparse matrices occurring in network flow problems. A commu¬ 
nication compiler and a partitioning routine are also provided (see Section 
12 . 11 ). 

Two separate storage representations of arbitrary sparse matrices are supported. 
These data mappings are referred to as the elementwise sparse matrix mapping 
and the block sparse matrix mapping. In the elementwise data mapping, the zero 
data values of the matrix are ignored and the non-zero data values are stored row¬ 
wise. hi the block sparse mapping, the sparse matrix is stored as a collection of 
dense block matrices. In its full matrix representation, this block matrix storage 
scheme is extremely flexible. The dense blocks need not be composed of contig¬ 
uous rows and columns, and may overlap in any way. One possible application 
for the block sparse representation is the finite element method. Structured finite 
element grids lead to a grid block sparse data layout; unstructured grids result in 
arbitrary block sparse layout. 
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The grid sparse matrix routines operate on data from grid-based applications. 
Coefficient matrix elements residing at each grid point P are multiplied by vector 
or matrix elements residing at point P and its nearest-neighbor points. The result 
is placed in product vector or matrix elements residing at point P. These routines 
support multiple instances and block matrices. 

The arbitrary and grid sparse matrix functions provide performance optimiza¬ 
tions based on the premise that applications will use these sparse functions 
repeatedly. A marginal setup cost can be incurred before the first call to the 
sparse functions. The setup cost is then amortized over several calls to the sparse 
matrix functions. 


12.4 Linear Algebraic Equations 

Dense Systems of Equations 

The CMSSL includes both die LU and the QR factorization methods for the solu¬ 
tion of one or more instances of dense linear algebraic equations: 

■ LU Factorization. These routines use Gaussian elimination (with or with¬ 
out partial pivoting) to factor one or more instances of an n X n matrix A 
into a lower triangular matrix L and an upper triangular matrix U, A=LU. 

■ LU Solution Routines. These routines use the triangular factors L and U 
produced by the LU factorization routines to produce solutions to the sys¬ 
tems LUX~B or (LU) t X**B. B may represent one or more right-hand sides 
for each instance of the systems of equations. 

a QR Factorization. These routines use Householder transformations (with 
or without pivoting) to factor one or more instances of an m x n matrix A, 
m>n, into a trapezoidal matrix Q and an upper triangular matrix R, A=QR. 

" QR Solution Routines. These routines use the Q and R factors produced by 
the QR factorization routines to solve one or more instances of the systems 
of equations QRX^B or (QR)^X=B. B may represent one or more right- 
hand sides for each instance of the systems of equations. 

■ Triangular System Solvers. These routines use die factors produced by the 
LU and QR factorization routines to solve triangular systems of equations 
(trapezoidal systems for Q). Thus, the CMSSL includes routines for solu¬ 
tion of one or more instances of triangular systems of equations of the 
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form LX=B, L T X=B, UX=B, iflX-B, RX=B, and R T X=B. B may represent 
one or more right-hand sides for each instance of the systems of equations. 

The CMSSL also contains routines for the solution of one or more 
instances of trapezoidal systems of equations QX=B or Q T X=B, where 
again B represents one or more right-hand sides for each instance of the 
systems of equations. 

■ Gauss-Jordan System Solver. This routine solves (with partial or total 
pivoting) a system of equations of the form AX=B using a version of 
Gauss-Jordan elimination. B represents one or more right-hand sides. 

■ Matrix Inversion. This routine inverts a square matrix A using the Gauss- 
Jordan routine. 

■ Utility Routines. The CMSSL also provides a set of utility routines 
associated with the factorization routines. For example, there are routines 
that explicitly compute L, U, and R from the representation used internally 
in the factorization routines. For QR factorization there are also routines 
for extracting the diagonal and computing the infinity norm. 

■ LU Factorization and Solution with External Storage. These routines use 
block Gaussian elimination with partial pivoting to reduce a matrix A to 
triangular form and solve the system AX-B, where A is too large to fit into 
core memory and is stored in a file. 

■ QR Factorization and Solution with External Storage. These routines use 
block Householder reflections to perform the factorization A=QR and 
solve the system AX=B, where A is too large to fit into core memory and 
is stored in a file. 


Banded Systems of Equations 

Banded linear system solvers solve systems of equations in which the non-zero 
elements of the coefficient matrix lie in a narrow band around the diagonal. The 
CMSSL provides routines for solving tridiagonal, pentadiagonal, block tridiag¬ 
onal, and block pentadiagonal systems of equations. Each routine solves multiple 
systems of equations, each with one or more right-hand sides, for both real and 
complex data types. A choice of algorithms is offered. 

The multiple-instance capability of the banded system routines in CMSSL is par¬ 
ticularly useful in connection with Alternating Direction Methods. You can 
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specify the axis along which the systems are to be solved. No data reordering or 
transposition is necessary for the solution of systems along any axis. 

The CMSSL includes three types of routines for the solution of banded systems 
of equations using direct methods: 

■ Factorization of Banded Systems. These routines support five different 
techniques for the factorization of one or more instances of tridiagonal, 
block tridiagonal, pentadiagonal, or block pentadiagonal matrices A: 

■ pipelined Gaussian elimination (with or without partial pivoting) 

■ substructuring with pipelined Gaussian elimination 

■ substructuring with cyclic reduction 

■ substructuring with balanced cyclic reduction 

■ substructuring with transpose 

■ Solution of Factored Banded Systems. These routines use the factors pro¬ 
duced by the factorization routines to solve one or more instances of 
systems of equations of the form LUX=B, where L and U are lower and 
upper (respectively) bidiagonal or block bidiagonal, or lower and upper 
(respectively) tridiagonal or block tridiagonal matrices, or permutations 
thereof. B represents one or more right-hand sides for each system of 
equations. 

* Factorization and Solution of Banded Systems. These routines support the 
same techniques as above for the solution of one or more instances of 
tridiagonal, block tridiagonal, pentadiagonal, or block pentadiagonal 
systems of equations AX=B. B represents one or more right-hand sides for 
each system of equations. 


Sparse Systems of Equations 

The CMSSL includes routines for the solution of sparse systems of equations by 
iterative techniques. Included are several standard sparse iterative solvers, 
including Conjugate Gradient (CG), Bi-Conjugate Gradient with Stabilization 
(BI-CG-STAB), Quasi-Minimum Residual (QMR), and restarted General-Mini¬ 
mum Residual (GMRES). 
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12.5 Eigensystem Analysis 
Dense Systems 

The CMSSL provides a variety of routines for performing eigensystem analysis 
of dense systems: 

■ Eigensystem, Analysis of Dense Hermitian Matrices. Computes the eigen¬ 
values and eigenvectors of one or more dense real symmetric or complex 
Hermitian matrices. 

■ Eigensystem Analysis using Jacobi Rotations. Computes the eigenvalues 
and eigenvectors of one or more dense real symmetric matrices using 
Jacobi rotations. 

■ Selected Eigenvalue and Eigenvector Analysis Using a k-Step Lanczos 
Method. Finds selected solutions [X, x} to the real standard or generalized 
eigenvalue problem Lx m XBx. B can be positive semi-definite and is the 
identity for the standard eigenproblem. The operator L is dense, real, and 
symmetric with respect to 5; that is, BL « L~ l B. The algorithm used is a 
Jk-step Lanczos algorithm with implicit restart. 

* Selected Eigenvalue and Eigenvector Analysis Using a k-Step Amoldi 
Method. Finds selected solutions [X, x} to the real standard or generalized 
eigenvalue problem Lx = XBx. B can be positive semi-definite and is the 
identity for the standard eigenproblem. The operator L is dense and real. 
The algorithm used is a fc-step Amoldi algorithm with implicit restart. 

■ Generalized Eigensystem Analysis of Dense Symmetric Matrices. Solves 
the generalized eigenvalue problem Ax = XBx, where A is dense, real, and 
symmetric, and B is positive definite. 

■ Reduction to Tridiagonal Form and Corresponding Basis Transformation. 
These routines reduce one or more dense real symmetric or complex Her¬ 
mitian matrices to real symmetric tridiagonal form using Householder 
transformations. After this reduction occurs, for each instance, you can 
transform the coordinates of an arbitrary set of vectors from the basis 
of the original Hermitian matrix to that of the tridiagonal matrix, or vice 
versa. 
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Tridiagonal Systems 

The CMSSL also includes routines for performing eigensystem analysis of real 
symmetric tridiagonal systems: 

■ Eigenvalues of Real Symmetric Tridiagonal Matrices. Computes the 
eigenvalues of one or more real symmetric tridiagonal matrices using a 
parallel bisection algorithm. 

■ Eigenvectors of Real Symmetric Tridiagonal Matrices. Computes the 
eigenvectors corresponding to a given set of eigenvalues for one or more 
real symmetric tridiagonal matrices, using an inverse iteration algorithm. 


Sparse Systems 

The Lanczos and Amoldi eigensystem analysis routines described above also 
apply to sparse systems. 


12.6 Fourier Transforms 

The CMSSL offers routines for the computation of Fourier Transforms by 
Cooley-Tukey type algorithms on one or more axes of arrays with an arbitrary 
number of axes. Fast Fourier Transforms (FFIS) have a wide range of scientific 
and engineering applications including digital filtering of discrete signals, 
smoothing and decomposition of optical images, correlation and autocorrelation 
of data series, numerical solution of partial differential equations such as Pois¬ 
son’s equation, and polynomial multiplication. 

The CMSSL provides the following FFT routines: 

■ Simple Complex-to-Complex FFT. Performs a complex-to-complex Fast 
Fourier Transform in the same direction along all axes of a data set. 

■ Detailed Complex-to-Complex FFT. Allows separate specification of the 
transform direction, scaling factor, and addressing mode along each data 
axis in a complex-to-complex FFT. Supports multiple instances. 
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■ Detailed Real-to-Complex and Complex-to-Real FFH. The real-to-com- 
plex FFT computes the Fourier transform of real data; the complex-to-real 
FFT transforms conjugate symmetric sequences. These operations allow 
separate specification of the transform direction, sc aling factor, and 
addressing mode along each data axis; they also support multiple 
instances. 

* Array Conversion Utilities . These utilities convert real arrays into com¬ 
plex arrays suitable for input to the real-to-complex FFT, and convert 
complex arrays (supplied in the format produced by the complex-to-real 
FFT) to real arrays. 


12.7 Ordinary Differential Equations 

The initial value problem for a system of N coupled first-order ordinary differen¬ 
tial equations (ODEs) 

dyt(x)ldx ~ Mx,y U -‘,yN) i m 1, • • •, N (1) 

consists of finding the values yi(x\) at some value x\ of the independent variable 
x, given the values y,(jco) of the variables at xq. The CMSSL provides a routine 
that solves the initial value problem by integrating explicitly the set of equations 
(1) using a fifth-order Runge-Kutta-Fehlberg formula. The control of the step 
size during the integration is automatic. The evaluation of the right-hand side and 
possibly the scaling array for accuracy control are provided by the user through 
a reverse communication interface. 


12.8 Optimization 

The CMSSL provides a routine that solves multidimensional minimization prob¬ 
lems using the simplex linear programming method. The goal is to find the 
minimum of a linear function of multiple independent variables. In the standard 
formulation, the problem is to minimize the inner product c T x subject to the 
condition Ax ** b,0<x<u, where A is a matrix, c is a coefficient vector, and c r x 
is referred to as the cost. The upper bound vector u may be infinity in one or more 
components. 
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The simplex routine’s reverse communication interface allows you to reinvert 
(reset the matrix values and restart the routine) when numerical errors accumu¬ 
late. You can fine-tune the frequency of reinversion and set a tolerance for 
degeneracy using input arguments. 


12.9 Random Number Generation 

Two varieties of random number generators (RNG) are included in the CMSSL: 

■ Fast RNG 

■ VP RNG 

These random number generators use a lagged-Fibonacci algorithm to produce 
a uniform distribution of random values. This implementation has been subjected 
to a battery of statistical tests, both on die stream of values within each processor 
and for cross-processor correlation. The only test that the CMSSL RNGs fail is the 
Birthday Spacings Test, as predicted by Marsaglia. Despite this failure, these 
lagged-Fibonacci RNGs are recommended for the most rigorous applications, 
such as Monte Carlo simulations of lattice gases. 

To construct pseudo-random values, the CMSSL random number generators use 
state tables. The Fast RNG allocates one state table per physical Connection 
Machine node. The VP RNG allocates one state table per array position. The Fast 
RNG thus consumes substantially less memory than the VP RNG. The VP RNG 
can produce identical results on differently sized partitions. 

Either CMSSL RNG may be reinitialized for reproducible results and check- 
pointed to guard against the effects of forced interruption. 


12.10 Statistical Analysis 

The CMSSL statistical analysis routines currently include two histogramming 
operations. Histograms provide a statistical mechanism for simplifying data. 
They are generally used in applications that need to display or extract summary 
information, especially in cases when the raw data sets are too large to fit into 
the Connection Machine system. Two routines are provided: one that tallies the 
occurrences of each value in a CM array, and one that counts the occurrences of 
values within specified value ranges. For particularly large data sets, the range 
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histogram operation facilitates breaking data down into subranges, perhaps as a 
preliminary step before doing more detailed analysis of interesting areas. 

Histograms have many applications in image analysis and computer vision. For 
example, a technique known as histogram equalization computes a histogram of 
pixel intensity values in an image and uses it to rescale the original picture. 

The CMSSL histogram operations treat the elements of a front-end array as a 
series of bins, hi each bin a tally of CM field values or value ranges is stored. The 
number of histogram bins varies widely with the application, from a dozen tallies 
on a large process or a few dozen markers on a probability distribution to a few 
hundred intensity values in an image or a few thousand instruction codes in a 
performance analysis. 


12.11 Communication Functions 

The CMSSL includes routines for efficient data motion for nearest-neighbor 
operations on regular grids, for all-to-all communication on segmented arrays, 
and for gather and scatter operations on unstructured grids. The library also pro¬ 
vides utilities for data distribution for load balancing of communication. 


Polyshift 

Many scientific applications make extensive use of array shifts in more than one 
direction and/or dimension in an array geometry. One well-known example is 
“stencils” used in solving partial differential equations (PDEs) by explicit finite 
difference methods. Similar communication patterns are encountered in other 
applications. For example, in quantum chromodynamics one needs to send 
(3 x n) complex matrices in each direction of a four-dimensional lattice. Mult¬ 
iple array shifts are also useful in many molecular dynamics codes. In the 
CMSSL, such multiple array shifts are called “polyshifts” (PSHIFIS). They can 
be recognized in CM Fortran code by a sequence of CSHIFT and/or EOSHIFT calls 
in multiple directions of multiple dimensions, with no data dependencies among 
the arguments and the results of the shifts. There is a potential performance gain 
in recognizing a polyshift communications pattern, and calling specially devel¬ 
oped routines for doing the shifts. In addition, application programs that utilize 
calls to polyshift routines can benefit from enhanced readability and maintain- 
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ability. The CMSSL includes a high-level interface for calling polyshift routines 
from CM Fortran. 


AII-to-AII Broadcast 

All-to-all broadcasting is often used to implement data interactions of the type 
occurring in many so-called JV-body computations, in which every particle inter¬ 
acts with every other particle. With an array distributed over a number of 
memory modules, each of which is associated with a parallel processing node, 
every module must receive the data from every other module. Another example 
of an application of all-to-all broadcasting is matrix-vector multiplication with 
the matrix distributed with entire rows per processor, and the vector distributed 
evenly over the processors. Every processor must gather all the elements of the 
vector in order to perform the required multiplication. 

The CMSSL supports two versions of all-to-all broadcast. One version is intended 
for applications in which memory requirements are at a premium. For these 
applications, the all-to-all broadcast can be performed in a stepwise manner in 
place. The CMSSL also supports such operations for applications in which the 
all-to-all broadcast can be performed at once. 


Ali-to-AII Reduction 

In all-to-all reduction, reduction operations such as sum, max, and min are per¬ 
formed concurrently on different data sets, each of which is distributed over all 
processing nodes; the results of the different reductions are evenly distributed 
over all nodes. In effect, an all-to-all reduction is the reverse operation of a 
broadcast, where sum, max, or min replaces the copy operation. 


Matrix Transpose 

The matrix transpose routine transposes two axes of a multidimensional array. 
This routine is designed specifically to provide enhanced performance when one 
of the axes to be exchanged is local (resides within a single processing node or 
vector unit) and the other is non-local (spans multiple nodes or vector units). 
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Gather and Scatter 

The CMSSL includes several gather and scatter utilities: 

a Sparse Gather and Scatter Utilities. These communication primitives are 
used by the CMSSL basic linear algebra routines for arbitrary sparse 
matrices. They are intended for applications that do not do explicit sparse 
linear algebra operations, but want to make use of some of the primitives 
commonly used in these operations. The gather utility gathers elements of 
a vector into an array using pointers supplied by the application; the scat¬ 
ter utility scatters elements of an array to a vector using pointers supplied 
by the application. Pre-processing is performed by associated setup 
routines. 

■ Enhanced Gather and Scatter Utilities. These utilities are used in conjunc¬ 
tion with the partitioning routine described in the next section. Because 
the partitioning routine maximizes data locality, the enhanced utilities are 
significantly faster than the original ones. (The pre-processing time 
includes the time used to run the partitioning routine, and can be substan¬ 
tial.) 

■ Block Gather and Scatter Utilities. These routines move a block of data 
elements from a source CM array into a destination CM array. The gather 
or scatter operation occurs along a single, specified serial axis. 

■ Vector Block Gather and Scatter Utilities. These routines are similar to the 
block gather and scatter routines, but each index element represents a vec¬ 
tor of data items rather than a single data item. 


Partitioning of an Unstructured Mesh 

The CMSSL provides a routine that allows you to reorder the elements of an 
unstructured mesh to form discrete partitions. Given an array describing an 
unstructured mesh, the routine returns a permutation of the mesh elements, the 
number of resulting partitions, and the number of elements per partition. You can 
use the permutation to reorder the data you supply to the preprocessor of the 
enhanced gather and scatter routines (described above). This strategy minimizes 
the off-vector-unit (or off-processing-node, for machines without vector units) 
communication required by the gather or scatter operation, since each partition 
resides within a vector unit (or processing node). 
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Communication Compiler 

The CMSSL communication compiler is a set of routines that compute and use 
message delivery optimizations for basic data motion and combining operations 
(get, send, send with overwrite, and send with combining). The communication 
compiler allows you to compute an optimization (or trace ) just once, and then 
use the trace many times in subsequent data motion and combining operations. 
This feature can yield significant time savings in applications that perform the 
same communication operation repeatedly. A variety of methods for computing 
a trace are available. 
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Visualization, the graphic representation of data, has come to be an essential 
component of scientific computing. Visualization techniques range from a sim¬ 
ple plotting of data points to sophisticated interactive simulations, but all allow 
researchers to analyze the results of their computations visually. One can literally 
“look at” the data to identify special areas of interest, anomalies, or errors that 
may not be apparent when scanning raw numbers. Visualization is often the only 
way to interpret the large data sets and complex problems common to the 
applications run on the Connection Machine system. 


13.1 A Distributed Graphics Strategy 

In keeping with its role as a network resource, the CM-5 uses a distributed graph¬ 
ics strategy to support a wide range of user applications. The key items in this 
strategy are 

■ the parallel processing power of the Connection Machine supercomputer 

■ the specialized power and interactive visualization environments, such as 
AVS, provided by dedicated graphics display stations 

■ the use of standard protocols, such as Xll, to allow communication among 
a variety of hardware and software 

A full range of interconnections is supported, from high-speed HLPPI interfaces 
through FDDI and Ethernet for longer-distance communications, to allow fast 
c ommuni cation between the CM and graphics display stations. 

Basically, the pattern is as follows: Computations carried out by the CM’s parallel 
processing nodes manipulate data to create graphics primitives, which can then 
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be sent to a graphics display station anywhere on the network This strategy lets 
users maximize the value of existing hardware and software, while taking advan¬ 
tage of the computational speed and power of the CM, the high bandwidth of CM 
IIO, and the rendering power and speed of graphics workstations (such as those 
from Silicon Graphics and Sun), which implement many advanced rendering 
techniques in hardware and offer extensive visualization environments to make 
interactive rendering easy for the user. 

Following this strategy, for example, a scientific visualization program can use 
the CM to compute image geometry (including, for example, polygon coordi¬ 
nates and color information) and then send it from the CM directly to local 
memory on the graphics workstation, where the results of simulations done on 
the CM can be displayed and analyzed interactively. 

At the workstation, users benefit from the ability to create and use graphical user 
interfaces (GUIs). GUIs are widely used today and growing in popularity, as their 
use enhances productivity for applications programmers and users alike, allows 
tighter coupling of simulation and visualization, and allows such activities as 
simulation steering. Many tools exist for the creation of such interfaces, and all 
are now available to the CM programmer. 



13.2 An Integrated Environment 

By using the distributed graphics strategy described above, together with an 
underlying protocol such as XU or an existing GUI, such as AVS, programmers 
can create and use a wide variety of integrated environments for their computa¬ 
tional and visualization tasks. Connection Machine software provides an 


106 


November 1993 

Copyright © 1993 Thinking Machines Corporation 





Chapter 13. Data Visualization 


environment that permits the exchange of very large data sets between the CM 
and framebuffers, workstations, or X window terminals. 


13.3 The XII Protocol 

Support for the network-based X graphics protocol is integral to the CM distrib¬ 
uted graphics strategy, since use of this protocol facilitates both data transfer and 
the use of GUIs, and allows considerable portability: data from a CM can be dis¬ 
played on any X workstation. 

But simple portability is not the only issue involved. As useful as graphics work¬ 
stations are, the extra-large data sets typically used in CM applications frequently 
provide more data than such workstations can readily handle. The solution to this 
problem lies partly in using the CM’s power to reduce the volume of information 
contained in the data sets so that the workstations can handle it rapidly, and partly 
in the successful integration of visualization environments, workstations, and 
high-speed framebuffers into a coherent system for rendering scientific data. 


13.4 The CMX11 Library 

The CMX11 library provides routines that allow the transfer of parallel data 
between the CM and any Xll terminal or workstation. The library is callable 
from CM Fortran and C*. It contains routines that draw text strings, polygons, 
and image-text strings; draw and fill points, lines, rectangles, and arcs; and draw 
and get images. The CMX11 library thus extends the Xll libraries by providing 
parallel network calls that use parallel variables instead of serial arrays. For 
example, where the X library offers an XDrawPoint routine, the CMX library 
offers CMXDrawPoint: 

CMXDrawPoint(Display *display, Drawable *d, 

GC gc, int x, int y) 

where x and y are pointers to parallel variables, and all other arguments are iden¬ 
tical to the serial call. 

Similarly, the CMX version of the Xll XPutlmage routine uses the arguments 
and semantics of the original to provide a parallel transfer of an image that exists 
as a parallel array: 
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CMXPutlmage(display, d, gc, data, depth, 
src_x, src_y, dest_x, dest_y, 
width, height) 

Note that no X protocol extensions are necessary, since the underlying CM socket 
mechanism makes the data source entirely transparent to the server. In most 
cases, the user simply makes the parallel version of the normal call, and the par¬ 
allel data is inserted into the data stream in the same format and position as it 
would have been in the equivalent serial call. This greatly facilitates the user’s 
task. 


13.5 Visualization Environments — CM/AVS 

CM/AVS is the first of the GUIs available on the CM-5. Other GUIs are expected 
to be adapted to the CM-5 in the future. 

CM/AVS adapts and extends the Application Visualization System (AVS, from 
Advanced Visualization Systems, Inc.) to the realm of the CM-5. AVS provides 
a rich graphic programming environment in which a user builds a distributed 
visualization application. An application may involve diverse operations such as 
filtering, graphing, volume rendering, polygon rendering, image processing, and 
animation. 

CM/AVS enables an application to operate on data that is distributed on CM-5 
processing nodes and to interoperate with data from other sources. A user runs 
AVS normally on a local workstation and uses the modules and functions that 
CM/AVS provides to process data on the CM-5. That way, the advantages of user- 
interface-intensive workstation visualization are combined with the power of 
data-intensive CM-5 applications. 
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The building blocks of an AVS application program are small, packaged units of 
code, called modules. Most modules process a set of inputs into a set of outputs. 
They provide functions such as volume rendering, isosurfacing, image proces¬ 
sing, polygon r endering , fluid-flow visualization, graphing, statistical analysis, 
file I/O, and many others. Hundreds of visualization modules are available from 
AVS and Thinking Machines and in the public domain. 

Modules are connected to form largo: applications, called networks, hi a net¬ 
work, info rmation is passed between the modules using a small number of 
standard data types such as arrays and geometric objects. The small number of 
data types allows a wide variety of modules to be interconnected, allowing rich, 
custom environments to be created. 

CM/AVS provides a parallel version of the AVS “field” data type. AVS fields are 
used to represent arbitrary arrays of data. CM/AVS’s parallel field data is allo¬ 
cated on the CM-5 processing nodes as CM Fortran arrays or C* parallel 
variables. 

hi die AVS network, parallel fields appear identical to regular serial fields; the 
two may be used interchangeably. When CM/AVS modules that operate on paral¬ 
lel data are connected with AVS modules that operate on serial data, CM/AVS 
routines convert the data between parallel and serial fields as required. The con¬ 
version is transparent to the user and to the module writer. 
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The Connection Machine communication library, CMMD, is an advanced, user- 
extensible communication library. 

* For portability, the CMMD library provides traditional message-passing 
functions that facilitate the porting of MIMD-style codes to the CM-5 and 
the design and creation, on the CM-5, of applications intended for execu¬ 
tion on a variety of machines. 

■ For users designing applications specifically for the CM-5, it offers a wide 
variety of communication functions that supplement the CM-5’s data par¬ 
allel software, allowing the use of multiple programming models and 
techniques. 

■ For users with an experimental bent, it offers primitives with which they 
can design and construct new communication protocols. 

■ For all users, it offers great versatility. Based on active messages, CMMD 
provides synchronous and asynchronous functions, heavy-weight and 
light-weight functions, and global operations — allowing users to make 
the most effective use of CM communications for their applications’ per¬ 
formance and efficiency. 

Applications using CMMD take advantage of the level of programming control 
offered by node-level programming. This type of programming is particularly 
well suited to applications that demand dynamic allocation of tasks or data 
among processors. 


November 1993 

Copyri^U © 1993 Thinking Machines Corporation 


111 




Connection Machine CM-5 Technical Summary 


14.1 Node-Level Programming with CMMD 

In node-level progra mming , a single program executes independently on each 
node; the nodes communicate only through explicit communication functions. 
CMMD provides for both synchronous and asynchronous communications func¬ 
tions; it thus provides interprocessor communication that falls outside the range 
of the data parallel languages. 

CMMD permits concurrent processing in which synchronization occurs only 

■ between matched sending and receiving nodes, during the execution of 
cooperative communication functions 

* among all nodes, during die execution of global communication or I/O 
functions 

* when explicitly requested by a CMMD synchronization function 

At all other times, computing on each node proceeds asynchronously. 

CMMD can be called from applications written in Fortran 77, CM Fortran, C, 
C++, and C*. It handles both serial and parallel data. 

In addition, CMMD offers both serial and parallel I/O facilities, based on UNIX 
I/O. In serial I/O, each node reads and writes files independently. (Opening and 
closing files may be done either independently or cooperatively.) In parallel I/O, 
the nodes cooperate to open, close, read, and write files: reading a file may send 
the same data to all nodes, or may distribute data across all nodes. 


14.2 Programming Models 

CMMD supports a variety of programming models. 


14.2.1 Hostless and Host-Node Models 

CMMD supports both the host/node programming model and the hostless pro¬ 
gramming model. In host/node programming, the user writes two programs: one 
runs on the host (a CM-5 partition manager), while independent copies of the 
node program run on each of the processing nodes. The host may have little 
involvement aside from initially invoking the node program and providing user 


112 


November 1993 

Copyright © 1993 Thinking Machines Corporation 




Chapter 14. CMMD 


interface services. In the hostless prog ramming model, the user writes a single 
program, independent copies of which run on every node. The host, meanwhile, 
acts as a server (usually an I/O servo:), executing a CMMD-supplied server pro¬ 
gram. (Users may customize this program, if they choose to do so.) 

These two generic models allow the use of a wide variety of more specific pro¬ 
gramming models, from the highly asynchronous (such as many master-worker 
programs and programs using tree-based algorithms) to the highly synchronized. 


14.2.2 The Global-Local Model 

CMMD also supports the global-local programming model. In this model, the 
main program is written in CM Fortran and executes globally, laying out parallel 
arrays across all processing nodes. The global program calls user-written “local 
routines,” written either in CM Fortran or in C plus DPEAC. The local routines 
perform node-level computations and use CMMD for inter-node communica¬ 
tions. (In essence, they are hostless CMMD programs.) When they finish, they 
return control to the global program. 

During the execution of local routines, each node operates on its own subgrid of 
a global parallel array as if that subgrid wore a complete nodal array. Informa¬ 
tional functions allow the node to locate the position its “subarray” within the 
global array. Local routines can also create arrays, common blocks, etc., of their 
own. These are completely local and are accessible only within the scope of the 
local routines. 

The global-local model is thus useful for applications that can benefit from both 
global data parallel and local data parallel and control parallel techniques. It is 
(me more example of the flexibility that marks Connection Machine program¬ 
ming. 


14.3 Message-Passing Protocols 

Message-passing protocols coordinate and control communications between two 
processors, operating either synchronously or asynchronously. Message passing 
can occur in cases in which each processor has no a priori knowledge of the data 
layout on other processors and cannot read or write remote memory location 
without “permission” from the remote node. It can also occur in more well- 
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defined environments, where communication patterns are relatively static or 
where nodes have knowledge of remote memory layouts. 

CMMD offers both functions that can handle the general case, where the proces¬ 
sors have no knowledge, and functions that take advantage of well-defined 
situations. In all, the library offers four classes of functions: 

■ Point-to-point functions, which use an initial handshake (RTS/ACK) proto¬ 
col to coordinate subsequent data transmissions. 

■ Functions that create and use long-lived virtual communication channels 
to support low-latency operations on messages of all sizes. 

■ Functions that support active messages and active message handlers; these 
provide an easily extensible mechanism for invoking functions on remote 
nodes via the transmission of an interrupting Data Network packet. 

■ Functions that use “receive ports” for low-latency transfer of arrays from 
one node to another. 


14.3.1 Point-to-Polnt Functions 

Point-to-point message-passing functions require coordination only between the 
sender and receiver. Both blocking and non-blocking functions are supplied. 
Blocking functions are “cooperative”: they do not complete until the 
corresponding function has been called on the destination processor. (If the 
second processor is not ready, the first processor waits for it.) Thus, they provide 
synchronization as well as communication. Non-blocking functions are 
asynchronous: they return as soon as the processor has announced its readiness 
either to send or to receive. The processor can then perform other work, while 
waiting for the destination processor to announce its readiness. When both 
processors have signaled readiness, they receive interrupt messages telling them 
to begin transmission. 

Point-to-point data transfers are performed in memory order. A transfer contin¬ 
ues until either the entire source array is sent, or until the destination array is 
filled up. 

On CM-5 systems equipped with vector units, CMMD point-to-point functions 
can transfer data from either microprocessor memory or vector unit memory. 
Consider tins as the act of transmitting arrays of data from one processor to 
another. The functions operate on three types of arrays: 
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* Serial arrays. Stored in microprocessor memory, serial arrays are 
specified simply by a starting address plus a length. (These are very 
standard-appearing function calls, in which the programmer merely 
supplies the array’s name plus a “buffer descriptor” argument that gives 
the array length, in bytes.) 

* Strided serial arrays. Like serial arrays, these are stared in microprocessor 
memory. They are defined by a starting address, an element size, a stride, 
and an element-count (specifying the number of elements to be trans¬ 
ferred). 

The stride is the distance (in bytes) between the starting point of one ele¬ 
ment and the starting point of the next element. 

* Parallel arrays. Parallel arrays are CM Fortran or C* arrays, and are 
defined by the usual CMRT array descriptor data structures for such arrays. 
These reside in the memory of the vector units, usually spanning the 
memories of all four VUs. (The CMMD program does not specify this data 
structure; it merely supplies the array’s name plus a “buffer descriptor” 
argument indicating that the array is parallel rather than serial.) 

Serial transfers can mix serial arrays and strided serial arrays, to provide scatter- 
gather behavior. (See Figure 25.) Parallel transfers always transfer data between 
identically laid-out arrays. Programmers may use CM Fortran or C* functions to 
reshape arrays, or to change serial to parallel arrays, or vice versa, if they wish 
to send data between parallel and serial arrays, or between unlike parallel arrays. 
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send 4 bytes 


receive 4 bytes 


send_v 4 elements: 
stride - 2, elem Jen ■ 1 


receive_v 4 elements: 
stride - 2, elem Jen - 1 


send 4 bytes 


receive_v 4 elements: 
stride - 2, elem Jen - 1 


send_v 4 elements: 
stride - 2, elem Jen - 1 


receive 4 bytes 



Figure 25. Sending and receiving data. 


Global Operations 
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The CMMD library also provides a number of global functions that operate under 
the same general protocols as the point-to-point functions. Global functions per¬ 
form their operations over all the nodes, and require participation from all nodes 
in the call. Some include and some exclude the host. Some perform a single 
operation and return their result as a return value; others operate on vectors of 
data and write their results into destination buffers. 

Global functions include 

■ broadcasting data or instructions from the host or from one node to all 
nodes 

■ reducing data from all nodes to all nodes or to the host 

■ performing scans (parallel prefix operations) across the nodes 

■ performing segmented parallel prefix operations 

■ concatenation of elements into a buffer on all nodes, or into a buffer on the 
host. 


14.3.2 Virtual Channels 

Communication patterns frequently remain constant over time. CMMD provides 
functions for opening, closing, and transmitting data via software communica¬ 
tion “channels” for these cases. These channels are uni-directional connections 
between specific processors. When one is opened, the two processors that 
opened it exchange information about the relative array shapes at each end of the 
channel. 

The channel remains active through as many uses as desired, allowing data to be 
transmitted without incurring the handshake overhead associated with traditional 
message-passing systems. By thus amortizing the overhead of establishing a 
channel over multiple uses, these functions allow progr amme rs to efficiently 
operate on static communication patterns and to send both small and large 
amounts of data repeatedly between nodes with low latency. 

Channels provide an implicit ordering or sequencing to transactions. Thus, they 
guarantee that messages are received in the order in which they are sent. 
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14.3.3 Active Messages 

Drawing on results of the TAM project at the University of California at 
Berkeley, active messages provide an easily extensible mechanism for invoking 
functions on remote nodes via the transmission of an interrupting Data Network 
packet. 

The format of an active message consists of one word containing the address of 
a “handler function” to be invoked on the destination processor, followed by n 
words (with n defined in the function call itself) making up the argument list to 
be passed to that function. 

Upon receipt of an active message, the destination processor invokes the speci¬ 
fied “handler function,” passing to it as arguments the contents of the message. 

CMMD makes use of active messages to perform protocol-processing functions. 
In addition, it provides a set of primitives that allow users to create their own 
communication functions according to their own application requirements. 


14.3.4 CMAML Array Transfers 

In addition to active messages themselves, CMAML (the CMMD Active Mes¬ 
sage Layer) also supports the transfers of blocks of data (usually arrays) from 
node to node. Either serial or parallel data can be transferred from the source 
node into a receive port on the destination node. When the transport operation 
completes, the source and/or destination node may invoke an attached handler 
function. 

Users familiar with the CMAML machinery will find it easy to construct a vari¬ 
ety of additional communication and memory exchange functions, such as 
remote memory store and fetch, send-to-queue, store-with-op, and compare-and- 
swap. 

As with most transport layers, CMAML functions assume that some higher layer 
of software is providing any needed protocol, and that (for example) receiving 
nodes know what to do with any data sent to them. Users must ensure that their 
applications provide such protocol. 
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14.4 1/0 

CMMD provides the node-level programmer with I/O routines for opening, read¬ 
ing, and writing files using both standard UNIX I/O semantics and extensions that 
provide for parallel I/O. Support for UNIX functions is provided by linking an I/O 
server routine into the host program. A standard host program including the 
server is provided by die cmmd-ld link utility for users not wishing to write a 
host program. 

Parallel extensions provided by CMMD apply to those UNIX I/O functions that 
take a file descriptor or stream as an argument, such as read (), write {), etc. 
These extensions coordinate the actions of nodes acting in concert so that the 
potential of CM-5 parallel I/O devices may be realized. UNIX calls that do not 
take a file descriptor, such as xnkdlr (), chdir (}, chmodO, etc., do not syn¬ 
chronize and may be called by any node. 


14.4.1 Support for Extra-Large Files 

Since UNIX uses only a 32-bit integer file pointer, seek operations on very large 
files are limited. CMMD therefore supplements the standard UNIX operations 
with functions that take an argument of type double, indicating a position within 
an extra-large file. 


14.4.2 I/O Modes 

In order to support UNIX functionality while simultaneously extending it to sup¬ 
port parallel operations, CMMD creates new file descriptors and introduces new 
I/O modes. 

The I/O modes defined by CMMD are as follows (see Figure 26): 

■ local independent 

■ global independent 

■ global synchronous broadcast 

■ global synchronous sequential 
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Figure 26. Four patterns for reading a file. 
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Local Mode 

Local mode supports completely independent I/O operations. In this mode, 
individual nodes can read and write different files independently, without having 
to coordinate activities with other nodes. Nodes maintain their own local file 
descriptors and file pointers. 


Global Independent Mode 

Global independent mode allows all nodes to access a single file for independent 
reading and writing. Files opened in this mode create only a single entry in the 
process’s table of file descriptors. Every node, however, maintains its own 
pointer to the file, and can move that pointer about at will, reading or writing the 
file in an independent manner. 

Local and global independent modes are used primarily to read ordinary files on 
UNIX file servers. The major reasons for choosing global independent mode over 
local mode are, first, the conservation of file descriptors, and second, the ability 
to change the mode of global independent files to one of the global synchronous 
modes, thereby achieving high-performance I/O. 


Global Synchronous Broadcast Mode 

Global synchronous broadcast mode allows nodes to simultaneously read the 
same data from a file or stream. Data, as it is read in, is broadcast to all the nodes. 
This is particularly useful for having all nodes read the same input from the 
user’s terminal or read in the same header information from a file. On output, 
global broadcast mode acts as if only processor 0 contributed data. A file or 
stream in synchronous broadcast mode must be written or read by all nodes in 
die partition. 


Global Synchronous Sequential Mode 

Global synchronous sequential mode is s imilar to global synchronous broadcast 
mode, except that in the sequential case, data is distributed across the nodes 
instead of broadcast. Again, all nodes must participate in the I/O call. 

For reads, each node issues a request for a buffer of a specified length. The data 
is read into the nodes as if node 0 first read, with all successive nodes following. 
For writes, each node contributes a buffer to be written, and the data is written 
as if node 0 first wrote its buffer, with each successive node’s data immediately 
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following. Note that the amount of data read or written may be different for each 
node. Nodes not having any data to contribute may set their buffer length to be 
zero. 


Establishing and Changing Modes 

Each file descriptor, when created, starts out in one of these modes. The mode 
can be changed by CMMD function calls as the program progresses. File descrip¬ 
tors for local files are created when an individual node opens a file; file 
descriptors for global files are created when all nodes open a file synchronously. 

All UNIX functions that take a file descriptor or stream as an argument are sensi¬ 
tive to the associated I/O mode. The major difference concerns synchronization 
across nodes. In either of die independent modes, operations on file descriptors 
proceed independently. In either of the synchronous modes, operations on 
file descriptors synchronize across all nodes in the process of performing the 
operation. 
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A Connection Machine CM-5 system contains thousands of computational pro¬ 
cessing nodes, one or more control processors, and I/O units that support mass 
storage, graphic display devices, and VME and HIPPI peripherals. These are con¬ 
nected by the Control Network and the Data Network. (For a high-level sketch 
of these components, see Figure 27.) 


15.1 Processors 

Every processing node is a general-purpose computer that can fetch and interpret 
its own instruction stream, execute arithmetic and logical instructions, calculate 
memory addresses, and perform interprocessor communication. The processing 
nodes in a CM-5 system can perform independent tasks or collaborate on a single 
problem. Each processing node has 8, 16, or 32 Mbytes of memory; with the 
high-performance arithmetic accelerator, it has 32 or 128 Mbytes of memory and 
delivers up to 160 Mops or 160 Mflops. 

The control processors are responsible for administrative actions such as sched¬ 
uling user tasks, allocating resources, servicing I/O requests, accounting, 
enforcing security, and diagnosing component failures. In addition, they may 
also execute some of the code for a user program. Control processors have the 
same general capabilities as processing nodes but are specialized for performing 
managerial functions rather than computational functions. For example, control 
processors have additional I/O connections and lack the high-performance arith¬ 
metic accelerator: (See Figure 28.) 

hi a small system, one control processor may play a number of roles. In larger 
systems, individual control processors are often dedicated to particular tasks and 
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referred to by names that reflect those tasks. Thus, a control processor that man¬ 
ages a partition and initiates execution of applications on that partition is referred 
to as a partition manager (PM), while a processor that controls an I/O device is 
called an I/O control processor (IOCP). 



Figure 27. System components. 

A CM-5 system contains tens, hundreds, or thousands of processing nodes, each with up to 
160 MQops of 64-bit floating-point performance. It also contains a number of I/O devices 
(disk storage nodes or tape storage nodes) and external connections (such as FDDI or 
BODPPI). The number of I/O devices and external connections is independent of the number 
of processing nodes. Both processing and I/O resources are managed by a relatively small 
set of control processors. All these components are uniformly integrated into the system by 
two internal communications networks, the Control Network and the Data Network. The 
Control Network provides multiway operations that can coordinate thousands of partici¬ 
pants, while the Data Network supports high-bandwidth bulk data transfers. The capacity 
of each network scales up with the size of the system; every processing node or I/O device 
gets the network capacity it needs. 
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15.2 Networks and I/O 

The Control Network provides tightly coupled communications services. It is 
optimized for fast response (low latency). Its functions include synchronizing the 
processing nodes, broadcasting a single value to every node, combining a value 
from every node to produce a single result, and computing certain parallel prefix 
operations. 


Control Network Data Network 



Standard Computer . 


t 

LAN Connection 


figure 28. Control processor. 

The basic CM-S control processor consists of a RISC microprocessor, memory subsystem, 
I/O (including local disks and Ethernet connections), and a CM-S Network Interface, all 
connected to a standard 64-bit bus. Except for the Network Interface, this is a standard 
off-the-shelf workstation-class computer system. The Network Interface connects the con¬ 
trol processor to the rest of the system through the Control Network and Data Network. 
Each control processor runs CMOST, a UNIX-based operating system with extensions for 
managing the parallel-processing resources of the CM-S. Some control processors are used 
to manage computational resources and some are used to manage I/O resources. 

The Data Network provides loosely coupled communications services. It is opti¬ 
mized for high bandwidth and excellent price/performance at any machine size. 
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Its basic function is to provide point-to-point data delivery for tens of thousands 
of items simultaneously. Special cases of this functionality include nearest- 
neighbor communication and FFT butterflies. Communications requests and data 
delivery need not be synchronized. Once the Data Network has accepted a mes¬ 
sage, it takes on all responsibility for its eventual delivery; the sending processor 
can then perform other computations while the message is in transit. Recipients 
may poll for messages or be notified by interrupt on arrival. The Data Network 
also transmits data between the processing nodes and I/O units. 

A standard Network Interface (NI) connects each node or control processor to the 
Control Network and Data Network. This is a memory-mapped control unit; 
reading or writing particular memory addresses will access network control reg¬ 
isters or trigger communication operations. 

The I/O units are connected to the Control Network and Data Network in exactly 
the same way as the processors, using the same Network Interface. Many I/O 
devices require more data bandwidth than a single NI can provide; in such cases 
multiple NI units are ganged. For example, a CM5-HIPPI channel interface con¬ 
tains 6 NI units, which provide access to 6 Data Network ports, covering 24 
network addresses. (At 20 Mbytes/sec apiece, 6 NI units provide enough band¬ 
width for a 100 Mbyte/sec HIPPI interface with some to spare.) 

Individual I/O devices are controlled by dedicated I/O control processors (IOCP). 
Some I/O devices are interfaces to external buses or networks; these include 
interfaces to VME buses and HIPPI channels. Noteworthy features of the I/O 
architecture are that I/O and computation can proceed independently and in par¬ 
allel, that data may be transferred between I/O devices without involving the 
processing nodes, and that the number of I/O devices may be increased com¬ 
pletely independently of the number of processing nodes. 

Hiding in the background is a third network, the Diagnostic Network. It can be 
used to isolate any hardware component and to test both the component itself and 
all connections to other components. The Diagnostic Network pervades the hard¬ 
ware system but is completely invisible to the user, indeed, it is invisible to most 
of the control processors. A small number of the control processors include com¬ 
mand interfaces for the Diagnostic Network; at any given time, one of these 
control processors provides the System Console function. 
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15.3 Further Information 

The following chapters discuss the CM-5 architecture in more detail. 

Chapter 16 contains a sketch of the user-level virtual machine, the programming 
model that is visible to a single user job. This virtual machine is supported by a 
combination of hardware, operating system, and run-time libraries. 

In Chapter 17, local architecture is considered: the structure of individual pro¬ 
cessors and associated memory. This is the view seen from any single processor 
in the system; it is the level of architecture where program code is executed. 

Chapter 18 discusses global architecture. This specifies how various compo¬ 
nents of the system operate together to solve a single problem. This level of 
architectural specification provides a framework for understanding the flow of 
control and the management of data in a massively parallel application. 

Chapter 19 describes the system architecture, which addresses support of mult¬ 
iple user jobs, communication between jobs, I/O transfers, fault diagnosis and 
repair, and system administration. 

Chapter 20 presents the I/O architecture, including the design of individual I/O 
devices and how they fit into the system structure. 
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The virtual machine provided by the hardware and operating system to a single 
user task consists of a control processor acting as a partition manager (PM), a set 
of processing nodes, and facilities for interprocessor communication. Each node 
is an ordinary general-purpose microprocessor capable of executing code written 
in C, Fortran, or assembly language. The processing nodes may also have 
optional vector units for high arithmetic performance. 

The operating system is CMOST, a version of SunOS enhanced to manage CM-5 
processor, I/O, and network resources. The PM provides full UNIX services 
through standard UNIX system calls. Each processing node provides a limited set 
of UNIX services. 

A user task consists of a standard UNIX process running on the PM and a process 
running on each of the processing nodes. Under timesharing, all processors are 
scheduled en masse, so that all are processing the same user task at the same 
time. F-arh process of the user task, whether on the PM or on a processing node, 
may execute completely independently of the rest during their common time 
slice. 

The Control Network and Data Network allow the various processes to synchro¬ 
nize and transfer data among themselves. The unprivileged control registers of 
the Network Interface hardware are mapped into the memory space of each user 
process, so that user programs on the various processors may communicate with¬ 
out incurring any operating system overhead. 
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16.1 Communications Facilities 

Each process of a user task can read and write messages directly to the Control 
Network and the Data Network. The network used depends on the task to be 
performed. 

The Control Network (CN) is responsible for communications patterns in which 
many processors may be involved in the processing of each datum. One example 
is broadcasting, where one processor provides a value and all other processors 
receive a copy. Another is reduction, where every processor provides a value and 
all values are combined to produce a single result Values may be combined by 
summing them, finding the maximum input value, or taking the logical OR or 
exclusive OR of all input values; the combined result may be delivered to a sing le, 
processor or to all processors. (Software provides minimum-value and logical 
AND operations by inverting the inputs, applying the hardware maximum-value 
or logical OR operation, then inverting the result.) Note that the control processor 
does not play a privileged role in these operations; a value may be broadcast 
from, or received by, the control processor or any processing node with equal 
facility. 

The Control Network contains integer and logical arithmetic hardware for 
carrying out reduction operations. This hardware is distinct from the arithmetic 
hardware of the processing nodes; CN operations may be overlapped with 
arithmetic processing by the processors themselves. The arithmetic hardware of 
the Control Network can also compute various forms of parallel prefix 
operations, where every processor provides a value and receives a result; the nth 
result is produced by combining the first n input values. Segmented parallel 
prefix operations are also supported in hardware. 

The Control Network provides a form of two-phase barrier synchronization (also 
known as “fuzzy” or “soft” barriers). A processor can indicate to the Control 
Network that it is ready to enter the barrier. When all processors have racked 
in, the Control Network relays this fact to all processors. A processor can thus 
overlap unrelated processing with the possible waiting period between the tima 
it has checked in and the time it has been determined that all processors have 
checked in. This allows thousands of processors to guarantee the ordering of 
certain of their operations without ever requiring that they all be exactly synchro¬ 
nized at one given instant. 

The Data Network is responsible for reliable, deadlock-free point-to-point 
transmission of tens of thousands of messages at once. Neither the senders nor 
the receivers of messages need be globally synchronized. At any time, any 
processor may send a message to any processor in the user task. This is done by 
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writing first the destination processor number, and then the data to be sent, to 
control registers in the Network Interface (NI). Once the Data Network has 
accepted the message, it assumes all responsibility for eventual delivery of the 
message to its destination. In order for a message to be delivered, the processor 
to which it was sent must accept the message from the Data Network. However, 
processor resources are not required for forwarding messages. The operation of 
the Data Network is independent of the processing nodes, which may carry out 
unrelated computations while messages are in transit. 

There is no separate interface for special patterns of point-to-point communica¬ 
tion, such as nearest neighbors within a grid. The Data Network presents a 
uniform interface to the software. The hardware implementation, however, has 
been tuned to exploit the locality found in commonly used communication pat¬ 
terns. 

There are two mechanisms for notifying a receiver that a message is available. 
The arrival of a message sets a status flag in a Network Interface control register; 
a user program can poll this flag to determine whether an incoming message is 
available. The arrival of a message can also optionally signal an interrupt. Inter¬ 
rupt handling is a privileged operation, but the operating system converts an 
arrived-message interrupt into a signal to the user process. Every message bears 
a four-bit tag; under operating system control, some tags cause message-arrival 
interrupts and others do not. (The operating system reserves certain of the tag 
numbers for its own use; the hardware signals an invalid-operation interrupt to 
the operating system if a user program attempts to use a reserved message tag.) 

The Control Network and Data Network provide flow control autonomously. In 
addition, two mechanisms exist for notifying a sender that the network is tempo¬ 
rarily clogged. Failure of the network to accept a message sets a status flag in a 
Network Interface control register; a user program can poll this flag to determine 
whether a retry is required. Failure to accept a message can also optionally signal 
an interrupt. 

Data can also be transferred from one user task to another, or to and from I/O 
devices. Both kinds of transfer are managed by the operating system using a 
common mechanism. An intertask data transfer is simply an I/O transfer through 
a named UNIX pipe. 


November 1993 

Copyright © 1993 Thinking Machines Corporation 


133 




Connection Machine CM-5 Technical Summary 


16.2 Data Parallel Computations 

While the user may code arbitrary programs for the various processors and put 
the general capabilities of the Network Interface to any desired use, the CM-5 
architecture is designed to support especially well the data parallel model of 
progr amming . Parallel programs are often structured as alternating phases of 
local computation and global communication. Local computation consists of 
operations by each processor on the data in its own memory. Global communica¬ 
tion includes any transfear of data between or among processors, possibly with 
arithmetic or logical computation on the data as it is transferred. By managing 
data transfers globally and coherently rather than piecemeal, the data parallel 
model often realizes economies of scale, reducing the overhead of synchro¬ 
nization for interprocessor communication. Frequently used patterns of 
communication are captured in carefully tuned compiler code generators and 
run-time library routines; they are presented as primitive operators or intrinsic 
functions in high-level languages so that die programmer need not constantly 
reinvent them. 

The following sections discuss various aspects of the data parallel programming 
model and sketch the ways in which each is supported by the CM-5 architecture 
and communications structure. 


Elemental and Conditional Computations 

Elemental computations, which involve operating on corresponding elements of 
arrays, are purely local computations if the arrays are divided in the same way 
among the processors. If two such matrices are to be added together, for example, 
every pair of numbers to be added reside together in the memory of a single 
processing node, and that node takes responsibility for performing the addition. 

Because each processing node executes its own instruction stream as well as 
processing its own local data, conditional operations are easily accommodated. 
For example, one processing node might contain an element on the boundary of 
an array while another might contain an interior element; certain filtering opera¬ 
tions, while allowing all elements to be processed at once, require differing 
computations for boundary elements and interior elements. In the CM-5 data 
parallel architecture, some processors can take one branch of a conditional and 
others can take a different branch simultaneously with no loss of efficiency. 
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Replication 

Replication consists of m aking copies of data. The most important special case 
is broadcasting, in which copies of a single item are sent to all processors. This 
is supported directly in hardware by the Control Network. 

Another common case is spreading, in which copies of elements of a lower¬ 
dimensional array are used to fill out the additional dimensions of a high-dimen¬ 
sional array. For example, a column vector might be spread into a matrix, so that 
each element of the vector is copied to every element of the corresponding row 
of the matrix. This case is handled by a combination of hardware mechanisms. 

If the processors are partitioned into clusters of differing size, such that the net¬ 
work addresses within each cluster are contiguous, then one or two parallel- 
prefix operations by the Control Network can copy a value from one processor 
in each cluster to all others in that cluster with particular speed. 


Reduction 

Reduction consists of combining many data elements to produce a smaller num¬ 
ber of results. The most important special case is global reduction, in which 
every processor contributes a value and a single result is produced. The opera¬ 
tions of integer summation, finding the integer maximum, logical OR, and logical 
exclusive OR are supported directly in hardware by die Control Network. Float¬ 
ing-point reduction operations are carried out by the nodes with the help of the 
Control Network and Data Network. 

A common operation sequence is a global reduction immediately followed by a 
broadcast of the resulting value. The Control Network supports this combination 
as a single step, carrying it out in no more time than a simple reduction. 

The cases of reduction along the axes of a multidimensional array correspond to 
the cases of spreading into a multidimensional array and have similar solutions. 
The rows of a matrix might be summed, for example, to form a column matrix. 
This case is handled by a combination of hardware me chanisms . 

If the processors are partitioned into clusters of differing size, such that the 
network addresses within each cluster are contiguous, then one or two parallel- 
prefix operations by the Control Network can reduce values from all processors 
within each cluster and optionally redistribute the result for that cluster to all 
processors in that cluster. 
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Permutation 

The Data Network is specifically designed to handle all cases of permutation, 
where each input value contributes to one result and each result is simply a copy 
of one input value. The Data Network has a single, unif orm hardware interface 
and a structure designed to provide especially good performance when the pat¬ 
tern of exchange exhibits reasonable locality. Both nearest-neighbor and 
nearest-but-one-neighbor c ommuni cation within a grid are examples of patterns 
with good locality. These particular patterns also exhibit regularity, but regularity 
is not a requirement for good Data Network performance. The irregular polygo¬ 
nal tesselations of a surface or a volume that are typical of finite-element 
methods lead to communications patterns that are irregular but local. The Data 
Network performs as well for such patterns as for regular grids. 


Parallel Prefix 

Parallel prefix operations embody a very specific, complex yet regular, combina¬ 
tion of replication and reduction operations. A parallel prefix operation produces 
as many results as there are inputs, but each input contributes to many results and 
each result is produced by combining multiple inputs. Specifically, the inputs and 
results are linearly ordered; suppose there are n of them. Then result j is the 
reduction of the first j inputs; it follows that input j contributes to the last n-j+1 
results. (For a reverse parallel prefix operation — also called a parallel suffix 
operation — these are reversed: result j is the reduction of the last n-j+1 inputs, 
and input j contributes to the first j results.) 

The Control Network handles parallel prefix (and parallel suffix) operations 
directly, in the same manner and at the same speed as reduction operations, for 
integer and logical combining operations. The input values and the results are 
linearly ordered by network address. 

The Control Network also directly supports segmented parallel prefix operations. 
If the processors are partitioned into clusters of differing size, such that the net¬ 
work addresses within each cluster are contiguous, then a single Control 
Network operation can compute a separate parallel prefix or suffix within each 
cluster. 

More complex cases of parallel prefix operations, such as on the rows or columns 
of a matrix or on linked lists, are variously handled through the Control Network 
or Data Network in cooperation with the nodes. 
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Virtual Processors 

Data parallel progr amming provides the high-level programmer with the illusion 
of as many processors as necessary; one programs as if there were a processor 
for every data element to be processed. These are often described as virtual pro¬ 
cessors, by analogy with conventional virtual memory, which provides the 
illusion of having more main memory than is physically present 

The CM-5 architecture, rather than implementing virtual processors entirely in 
firmware, relies primarily on software technology to support virtual processors. 
CM-5 compilers for high-level data parallel languages generate control-loop code 
and run-time library calls to be executed by the processing nodes. This provides 
the same virtual-processor functionality made available by the Paris instruction 
set on the Connection Machine models CM-2 and CM-200, but adds further 
opportunities for compile-time optimization. 


16.3 Low-Levei User Programming 

Low-level programs may be written for the CM-5 in C or Fortran 77. Assembly 
language is also available, though C should be adequate for most low-level 
purposes; all hardware facilities are directly accessible to the C programmer. A 
special assembler allows hand-coding of individual vector instructions for the 
processing nodes. Such instructions may be assembled separately or inserted 
directly into C code. 

One writes low-level programs as two pieces of code: one piece is executed in 
the control processor, and the other is replicated at program start-up and executed 
by each processing node. One speaks of writing a program in “C & C” (a C pro¬ 
gram for the control processor and a C program for the nodes); one may also 
write in “Fortran & Fortran” (“F & F”), in “C & assembler,” etc. 

A package of macros and run-time functions supports common communications 
operations within a message-passing framework (see Chapter 14). Such low- 
level communications access allows the user to experiment with program 
organizations other than data parallel, to port programs easily from MEMD 
architectures, and to implement new primitives for use in high-level programs. 
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17.1 Control Processor Architecture 

A control processor (CP) is essentially like a standard high-performance work¬ 
station computer. It consists of a standard RISC microprocessor, associated 
memory and memory interface, and perhaps I/O devices such as local disks and 
Ethernet connections. It also includes a CM-5 Network Interface, providing 
access to the Control Network and Data Network. 

A control processor acting as a partition manager (PM) controls each partition 
and communicates with the rest of the CM-5 system through the Control Network 
and Data Network. For example, a PM initiates I/O by sending a request through 
the Data Network to a second CP, an VO control processor. A PM initiates task¬ 
switching by using the Control Network to send a broadcast interrupt to all 
processing nodes; privileged operating-system support code in each node then 
carries out the bulk of the work. To access the Control Network and Data Net¬ 
work, each CP uses its Network Interface, a memory-mapped device in the 
memory address space of its microprocessor. 

The microprocessor supports the customary distinction between user and super¬ 
visor code. User code can run in the control processor at the same time that user 
code for the same job is r unning in the processing nodes. Protection of the super¬ 
visor, and of one user from another, is supported by the same mechanisms used 
in workstations and single-processor timeshared computers, namely memory 
address mapping and protection and the suppression of privileged operations in 
user mode, hi particular, the operating system prevents a user process from per¬ 
forming privileged Network Interface operations; the privileged control registers 
simply are not mapped into the user address space. 
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Current implementations of the CM-5 control processor use either SPARC or 
SuperSparc microprocessors. It is expected that, over time, the implementation of 
the CP will track the RISC microprocessor technology curve to provide the best 
possible functionality and performance at any given point in time; therefore it 
is recommended that low-level programming be carried out in C as much as pos¬ 
sible, rather than in assembly language. 


17.2 Processing Node Architecture 

The CM-5 processing node is designed to deliver very good cost-performance 
when used in large numbers for data parallel applications. like the control pro¬ 
cessor, the node makes use of industry-standard RISC microprocessor 
technology. This microprocessor may optionally be augmented with a special 
high-performance hardware arithmetic accelerator that uses wide data paths, 
deep pipelines, and large register files to improve peak computational perfor¬ 
mance. 

The node design is centered around a standard 64-bit bus. To this node bus are 
attached a RISC microprocessor, a CM-5 Network Interface, and memory. Note 
that all logical connections to the rest of the system pass through the Network 
Interface. 

The node memory consists of standard DRAM chips and an 8-Kbyte boot ROM; 
the microprocessor also has a 64-Kbyte cache that holds both instructions and 
data. All DRAM memory is protected by ECC checking, which corrects single-bit 
failure and detects two-bit errors and DRAM chip failures. The boot ROM con¬ 
tains code to be executed following a system reset, including local processor and 
memory verification and the communications code needed to download further 
diagnostics or operating system code. 

The memory configuration depends on whether the optional high-performance 
arithmetic hardware is included. Without the arithmetic hardware, the memory 
is connected by a 72-bit path (64 data bits plus 8 ECC bits) to a memory control¬ 
ler that in turn is attached to the node bus. (See Figure 29.) ha this c onfig uration 
the memory size can be 8, 16, or 32 Mbytes. (This assumes 4-Mbit DRAM 
technology. Future improvements in DRAM technology will permit increases in 
memory size. The CM-5 architecture and chip implementations anticipate these 
future improvements.) 
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Figure 29. Processing node. 

The banc CM-S processing node consists of a RISC microprocessor, memory subsystem, and 
a CM-5 Network Interface all connected to a standard 64-bit bus. The RISC microprocessor 
is responsible for instruction fetch, instruction execution, processing data, and controlling 
the Network Interface. The memory subsystem consists of a memory controller and either 
8 Mbytes, 16 Mbytes, or 32 Mbytes of DRAM memory. The path from each memory back 
to the memory controller is 72 bits wide, consisting of 64 data bits and 8 bits of ECC code. 
The ECC circuits in the memory controller can correct single-bit errors and detect double¬ 
bit errors as well as failure of any single DRAM chip. The Network Interface connects the 
node to the rest of the system through the Control Network and Data Network. 
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If the high-performance arithmetic hardware is included, then the node memory 
is divided into four independent banks, each with a 72-bit (64 data bits plus 
8 ECC bits) access path. (See Figure 30.) 
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Figure 30. Processing node with vector units. 

A CM-5 processing node may optionally contain an arithmetic accelerator. In this con¬ 
figuration the node has 32 or 128 Mbytes of memory', four banks of 8 or 32 Mbytes each. 
The memory controller is replaced by four vector units. Each vector unit has a dedicated 
72-bit path to its associated memory bank, providing peak memory bandwidth of 160 
Mbytes/sec per vector unit, and performs all the functions of a memory controller, includ¬ 
ing generation and checking of ECC bits. Each vector unit has 40 Mflops peak 64-bit float¬ 
ing-point performance and 40 Mops peak 64-bit integer performance. The vector units 
execute vector instructions issued to them by the RISC microprocessor. Each vector 
instruction may be issued to a specific vector unit (or pair of units), or broadcast to all four 
vector units at once. The microprocessor takes care of such “housekeeping” computations 
as address calculation and loop control, overlapping them with vector instruction execu¬ 
tion. Together, the vector units provide 640 Mbytes/sec memory bandwidth and 160 Mflops 
peak 64-bit floating-point performance. A single CM-5 node with vector units is a super¬ 
computer in itself. 
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The special arithmetic hardware consists of four vector units (VUs), one for each 
memory bank, connected separately to the node bus. hi this configuration the 
memory size is either 8 or 32 Mbytes per VU for a total of 32 or 128 Mbytes per 
node. (This figure assumes either 4-Mbit or 16-Mbit DRAM technology and will 
increase as industry-standard memories are improved.) Each VU also imple¬ 
ments all memory controller functions, including ECC checking, so that the 
entire memory appears to be in the address space of the microprocessor exactly 
as if the arithmetic hardware were not present. 

The memory controller or vector unit also provides a word-based interface to the 
system Diagnostics Network (see Section 19.8). This provides an extra commu¬ 
nications path to the node; it is designed to be slow but reliable and is used 
primarily for hardware fault diagnosis. 

As with the control processors, the implementation of the CM-5 processing node 
is expected to track the RISC microprocessor technology curve to provide the 
best possible functionality and performance at any given point in time; therefore 
it is recommended that low-level programming be carried out in C as much as 
possible, rather than in assembly language. Current implementations of the CM-5 
processing node use SPARC or SuperSparc microprocessors. 


17.3 Vector Unit Architecture 

Each vector unit (VU) is a memory controller and computational engine con¬ 
trolled by a memory-mapped control-register interface. (See Figure 31.) When 
a read or write operation on the node bus addresses a VU, the memory address 
is further decoded. High-order bits indicate the operation type: 

■ For an ordinary memory transaction, the low-order address bits indicate 
a location in the memory bank associated with the VU, which acts as a 
memory controller and performs the requested memory read or write 
operation. 

■ For a control register access, the low-order address bits indicate a control 
register to be read or written. 

■ For a data register access, the low-order address bits indicate a data regis¬ 
ter to be read or written. 
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■ For a -vector-unit instruction, the node memory bus operation must be 
write (an attempt to read from this part of the address space results in 
a bus error). The data on the memory bus is not written to memory but is 
interpreted as an instruction to be executed by the vector execution portion 
of the VU. The low-order address bits indicate a location in the memory 
bank associated with the VU; the instruction uses this address if it includes 
operations on memory. A vector-unit instruction may be addressed to any 
single VU (in which case the other three VUs ignore it), to a pair of VUs, 
or to all four VUs simultaneously. 



Figure 31. Vector unit functional architecture. 

The first two types of operation are identical to those performed by the memory 
controller when vector units are absent. The third type permits the microproces¬ 
sor to read or write the register file of any vector unit. The fourth type 
of operation initiates high-performance arithmetic computation. This computa- 
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tion has both vector and parallel characteristics: each VU can perform vector 
operations, and a single instruction may be issued simultaneously to all four. 
If the vector length is 16, then issuing a single instruction can result in as many 
as 64 individual arithmetic instructions (16 for each of the four VUs), or even 
128 operations if the instruction specifies a compound operation such as multi - 
ply-add. 

Vector units cannot fetch their own instructions; they merely react to instructions 
issued to them by the microprocessor. The instruction format, instruction set, and 
maximum vector length have been chosen so that the microprocessor can keep 
the vector units busy while having time of its own to fetch instructions (both its 
own and those for the vector units), calculate addresses, execute loop and branch 
instructions, and carry out other algorithmic bookkeeping. 

Each vector unit has 64 64-bit registers, which can also be addressed as 128 
32-bit registers. Other control registers worth noting are the 16-bit vector mask 
(VM) and the 4-bit vector length (VL) registers. The vector mask register controls 
certain conditional operations and optionally receives single-bit status results for 
each vector element processed. The vector length register specifies the number 
of elements to be processed by each vector instruction. 

The vector unit actually processes both vector and scalar instructions; a scalar¬ 
mode instruction is handled as if it were a vector-mode instruction of length 1. 
Thus scalar-mode instructions always operate on single registers; vector-mode 
instructions operate on sequences of registers. Each register operand is specified 
by a 7-bit starting register number and a 7-bit stride. The first element for that 
vector operand is taken from the starting register; thereafter the register number 
is incremented by the stride to produce a new register number indicating the next 
element to be processed. Using a large stride has the same effect as using a 
negative stride, so it is possible to process a vector in reverse order. Most instruc¬ 
tion formats use a default stride of 1 for 32-bit operands or 2 for 64-bit operands, 
so as to process successive registers, but one instruction format allows arbitrary 
strides to be specified for all operands, and another allows one vector operand 
to take its elements from an arbitrary pattern of registers by means of a mecha¬ 
nism for indirect addressing of the register file. 

Each vector unit includes an adder, a multiplier, memory load/store, indirect reg¬ 
ister addressing, indirect memory addressing, and population count. Every 
vector-unit instruction can specify at least one arithmetic operation and an inde¬ 
pendent memory operation. Every instruction also has four register-address 
fields: three for the arithmetic operation and one for the memory operation. All 
binary arithmetic operations are fully three-address; an addition, for example, 
can read two source registers and write into a third destination register. The 
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memory operation can address a completely independent register. If, however, a 
load operation addresses a register that is also a source for the arithmetic opera¬ 
tion, then load-chaining occurs, so that the loaded memory data is used as an 
arithmetic operand in the same instruction. Indirect memory addressing supports 
scatter/gather operations and vectorized pointer indirection. 

Two mechanisms provide for conditional processing of vector elements within 
each processing node. Each vector unit contains a vector mask register; vector 
elements are not processed in positions where the corresponding vector mask bit 
is zero. Alternatively, a vector-mask enumeration mechanism may be used in 
conjunction with the scatter/gather facility to pack vector elements that require 
similar processing; after bulk application of unconditional vector operations, the 
results are then unpacked and scattered to their originally intended destinations. 

Vector-unit instructions come in five formats. (See Figure 32.) The 32-bit short 
format allows many common scalar and vector operations to be expressed suc¬ 
cinctly. The four 64-bit long formats extend the basic 32-bit format to allow 
additional information to be specified: a 32-bit immediate operand, a signed 
memory stride, a set of register strides, or additional control fields (some of 
which can update certain control registers with no additional overhead). 

The short format includes an arithmetic opcode (8 bits), a load/store opcode (3 
bits), a vector/scalar mode specifier (2 bits), and four register fields called rLS, 
rD, rSl, and rS2 that designate the starting registers for the load/store operation 
and for the arithmetic destination, first source, and second source, respectively. 
The vector/scalar specifier indicates whether the instruction is to be executed 
once (scalar mode) or many times (vector mode). It also governs the expansion 
of the 4-bit register specifiers into full 7-bit register addresses. The short format 
is designed to support a conventional division of the uniform register file into 
vector registers of length 16, 8, or (for 64-bit operands only) 4, with scalar quan¬ 
tities kept in the first 16 registers. For a scalar-mode instruction, the 4-bit register 
field provides the low-order bits of the register number (which is then multiplied 
by 2 for 64-bit operands); for a vector-mode instruction, it provides the high- 
order bits of the register number. The rSl field is 7 bits wide; in some cases these 
specify a full 7-bit register number for arithmetic source 1 and in other cases 4 
bits specify a vector register and the other 3 bits convey stride information. 
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Figure 32. Vector unit instruction formats. 

Each instruction issued by the RISC microprocessor to the vector units is 32 bits or 64 bits wide. The 
32-bit format is designed to cover the operations and register access patterns most likely to arise in 
high-performance compiled code. The 32 high-order bits of the 64-bit format are identical to the 32-bit 
format The 32 low-order bits provide an immediate operand, a signed memory stride, or specifica¬ 
tions for more complex or less frequent operations. 


A short scalar-mode instruction can therefore access the first 16 32-bit or 64-bit 
elements of the register file, simultaneously performing an arithmetic operation 
and loading or storing a register. (The memory address that accompanies the 
issued instruction indicates the memory location to be accessed.) One of the 
arithmetic operands (SI) may be in any of the 128 registers in the register file. 

A short vector-mode instruction can conveniently treat the register file as a set 
of vector registers: 
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Many options are available for vector-mode instructions. These include a choice 
between a default memory stride and the last explicitly specified memory stride, 
as well as a choice of registrar stride for the SI operand (last specified, 1, or 
0 — stride 0 treats the S1 operand as a scalar to be combined with every element 
of a vector). 

Hie long instruction formats are all compatible extensions of the short format: 
the most significant 32 bits of a 64-bit instruction are decoded as a 32-bit instruc¬ 
tion, and the least significant 32 bits specify additional operations or operands. 
If the rS2 field of a long instruction is zero, then the low-order 32 bits of the 
instruction constitute an immediate scalar value to be used as the S2 operand. If 
the arithmetic operation requires a 64-bit operand, then the immediate value is 
zero-extended left if an unsigned integer is required, sign-extended left far a 
signed integer, or zero-extended right for a floating-point number. 

If the rS2 field of a long instruction is not zero, then the two high-order bits of 
the low 32 are decoded. If the two bits match, then the low-order 32 bits are an 
explicit signed memory stride. (Note that it is possible to specify such a stride 
even in a scalar-mode long instruction, in order to latch the stride in preparation 
for a following vector-mode instruction that might need to use another of the 
long formats.) Code 10 indicates additional register number and register stride 
information, allowing specification of complete 7-bit register numbers and regis¬ 
ter strides for the rLS, rD, and rS2 operands. This enables complex regular 
patterns of register access. Code 01 indicates a variety of control fields for such 
mechanisms as changing the vector length, controlling use of the vector mask, 
indirect addressing, SI operand register striding, and population count. 

The arithmetic operations that can be specified by the ALU-F instruction field are 
s ummariz ed in Table 1. Note the large set of three-operand multiply-add instruc¬ 
tions. These come in three different addressing patterns: accumulative, which 
adds a product into a destination register (useful for dot products); inverted, 
which multiplies the destination by one source and then adds in the other (useful 
for polynomial evaluation and integer subscript computations); and full triadic, 
which takes one operand from the load/store register so that the destination regis¬ 
ter may be distinct from all three sources. The triadic multiply-add operations are 
provided for signed and unsigned integers as well as for floating-point operands, 
in both 32-bit and 64-bit sizes. Unsigned 64-bit multiply-boolean operations are 
also provided. (Note that multiplying by a power of two has the effect of a shift.) 
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Table 1. Summary of vector unit arithmetic instructions (Part I). 


imove 

dimove umove 

dumove fmove 

dfmove 

Move: D = SI + 0 

itest 

ditest utest 

dutest ftest 

dftest 

Move and generate status 

icmp 

dicmp ucmp 

ducmp fcmp 

dfcitvp 

Compare 

iadd 

diadd uadd 

duadd fadd 

dfadd 

Add 

isub 

disub usub 

dusub fsub 

df sub 

Subtract 

isubr 

disubr usubr 

dusubr fsubr 

dfsubr 

Subtract reversed 

imul 

dimul umul 

dumul fmul 

dfmul 

Multiply (low 64 bits for integers) 


dimulh 

dumulh 


Integer multiply (high 64 bits) 



fdiv 

dfdiv 

Divide 



f inv 

df inv 

Invert: D - 1.0/S 1 



f sqrt 

dfsqrt 

Square root 



f isqt 

dfisqt 

Inverse square root: D * 1.0/SQRT(S2) 

ineg 

dineg 

fneg 

dfneg 

Negate 

iabs 

diabs 

fabs 

dfabs 

Absolute value 

iaddc 

diaddc uaddc 

duaddc 


Integer add with carry 

isubc 

disubc usubc 

dusubc 


Integer subtract with borrow 

isbrc 

disbrc usbrc 

dusbxc 


Integer subtract reversed with borrow 


ushl 

dushl 


Integer shift left 


ushlr 

dushlr 


Integer shift left reversed 


ushr 

dushr 


Integer shift right logical 


ushrr 

dushrr 


Integer shift right logical reversed 

ishr 

dishr 



Integer shift right arithmetic 

ishrr 

dishrr 



Integer shift right arithmetic reversed 


uand 

duand 


Bitwise logical AND 


uandc 

duandc 


Bitwise logical AND with complement 


unand 

dunand 


Bitwise logical NAND 


uor 

duor 


Bitwise logical OR 


unor 

dunor 


Bitwise logical NOR 


uxor 

duxor 


Bitwise logical XOR 


unot 

dunot 


Bitwise logical NOT 


uxnxg 

dumrg 


Merge: D = (if mask then S2 else SI) 


uf fb 

duffb 


Find first 1-bit 
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Table 1. Summary of vector unit arithmetic instructions (Part II). 


imada 

dimada umada 

dumada fmada 

dfmada 

rD = (rSl * rS2) + rD 

imsba 

dimsba umsba 

dumsba fmsba 

dfmsba 

rD = (rSl * rS2) - rD 

imsra 

dimsra umsra 

dumsra fmsra 

dfmsra 

rD = - (rSl * rS2) + rD 

inmaa 

dinmaa unmaa 

dunmaa fnmaa 

dfnmaa 

rD = - (rSl * rS2) - rD 

imadi 

dimadi umadi 

dumadi fmadi 

dfmadi 

rD = (rS2 * rD) + rSl 

imsbi 

dimsbi umsbi 

dumsbi fmsbi 

dfmsbi 

rD = (rS2 * rD) - rSl 

imsri 

dimsri umsri 

dumsri fmsxi 

dfmsri 

rD = - (rS2 * rD) + rSl 

ininai 

dinmai unmai 

dunmai fnmai 

dfnxnai 

rD = - (rS2 * rD) - rSl 

imadt 

dimadt umadt 

dumadt fmadt 

dfmadt 

rD = (rSl * rLS) + rS2 

imsbt 

dimsbt umsbt 

dumsbt fmsbt 

dfmsbt 

rD - (rSl * rLS) - rS2 

imsrt 

dimsrt umsr t 

dumsrt frasrt 

dfmsrt 

rD - - (rSl * rLS) + rS2 

inmat 

dinmat uxunat 

dunmat fnmat 

dfnmat 

rD = - (rSl * rLS) - rS2 



dumsa 


rD ■= lower(rSl * rS2) AND rD 



dumhsa 


rD = iqjper(rSl * rS2) AND rD 



dumma 


rD = lower(rSl * rS2) AND NOT rD 



dumhma 


rD * upper(rSl * rS2) AND NOT rD 



dumoa 


rD = lower(rSl * rS2) OR rD 



dumhoa 


rD = upper(rSl * rS2) OR rD 



dumxa 


rD - lower(rSl * rS2) XOR rD 



dumhxa 


rD ” upper (rSl * rS2) XOR rD 



dumsi 


rD = lower(rS2 * rD) AND rSl 



dumhsi 


rD ■ upper(rS2 * rD) AND rSl 



dummi 


rD = lower(rS2 * d>) AND NOT rSl 



dumhmi 


rD = upper(rS2 * rD) AND NOT rSl 



dumoi 


rD = lower(rS2 * rD) OR rSl 



dumhoi 


rD «= upper(rS2 * rD) OR rSl 



duznxi 


rD = lower(rS2 * rD) XOR rSl 



dumhxi 


rD = upper(rS2 * rD) XOR rSl 



dumst 


rD *= upper (rSl * rLS) AND rS2 



dumhst 


rD «= upper(rSl * rLS) AND rS2 



dununt 


rD = lowear(rSl * rLS) AND NOT rS2 



dumhmt 


rD = upper(rSl * rLS) AND NOT rS2 



dumot 


rD = lower(rSl * rLS) ORrS2 



dumhot 


rD = upper(rSl * rLS) OR rS2 



dumxt 


rD « lower(rSl * rLS) XOR rS2 



dumhxt 


rD » upper(rSl * rLS) XOR rS2 
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Table 1. Summary of vector unit arithmetic instructions (Part HI). 


cvtfi 

cvtf 

cvtir 

cvti 

crap 

etrap 

ldvm 

stvm 


fclas dfclas 
fexp dfexp 
fmant dfmant 

uenc duenc 

fnop 


Classify operand 
Extract exponent 
Extract mantissa with hidden bit 
Make float from exponent (SI) and 
mantissa (S2) 

No arithmetic operation 
Convert integer to float* 

Convert float to float* 

Convert float to integer (round)* 
Convert float to integer (truncate)* 
Generate debug trap 
Generate trap on enabled exception 
Load vector mask 
Store vector mask 


* The rS2 field encodes the source and result sizes and formats for these instructions. 


The LS-F instruction field specifies one of 5 load/store operations: 

* no operation 

■ 32-bit load 

■ 64-bit load 

* 32-bit store 

■ 64-bit store 

The load/store size (32 or 64 bits) need not be the same as the arithmetic operand 
size. They should be the same, however, if load chaining is used. There is no 
distinction between integer and floating-point loads and stores. A 64-bit load or 
store may be used to load or store an even-odd 32-bit register pair. 


Executing Vector Code 

All instruction fetching and control decisions for the vector units are made by the 
node microprocessor. When vector units are present, all instructions and data 
reside in the memory banks associated with the vector units. A portion of each 
memory bank is conventionally reserved for instruction and data areas for the 
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microprocessor. The memory management hardware of the microprocessor is 
used to map pages from the four memory banks so as to make them appear con¬ 
tiguous to the microprocessor. 

While die microprocessor does not have its own memory, it does have a local 
cache that is used for both instruction and data references. Thus, the micropro¬ 
cessor and vector units can execute concurrently so long as no cache misses 
occur. 

When a cache block must be fetched from memory, the associated vector unit 
may be in one of three states. If it is not performing any local operations, then 
the cache block is fetched immediately. If it is perfo rmin g a local load or store 
operation, then the block fetch is delayed until the operation completes. If the 
vector unit is doing an operation that does not require the memory bus, then the 
block fetch proceeds immediately, concurrently with the executing vector 
operation. 

The microprocessor issues VU instructions by storing to a specially constructed 
address: the microprocessor fetches the instruction itself from its data memory, 
calculates the special vector-unit destination address for issuing die instruction, 
and executes the store. The time it takes the microprocessor to do this is generally 
less than the time it takes a vector unit to execute an instruction with a vector 
length of 4. Moreover, die tail end of one vector instruction may be overlapped 
in time with the beginning of the next, thus eliminating memory latency and vec¬ 
tor start-up overhead. With careful programming, therefore, the microprocessor 
can sustain delivery of vector instructions so as to keep the vector units continu¬ 
ously busy. 

The vector unit is optimally suited for a vector length of 8; with vectors this long, 
the timing requirements are not so critical, and the microprocessor has time to 
spare for bookkeeping operations. The short vector-unit instruction format sup¬ 
ports addressing of length-8 register blocks for either 32-bit or 64-bit operands. 
This provides 8 vector registers for 64-bit elements or 16 vector registers for 
32-bit elements, with the first two such register blocks also addressable as 16 
scalar registers. This is only a conventional arrangement, however; long-format 
instructions can address the registers in arbitrary patterns. 

Flow control of instructions to the vector units is managed using the hardware 
protocol of the node bus. When a vector instruction is issued by the microproces¬ 
sor, any addressed vector unit may stall the bus if it is busy. A small write buffer 
and independent bus controller within the microprocessor allows it to continue 
local execution of its own instructions while the bus is stalled by a vector unit. 
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If the microprocessor gets far enough ahead, the small write buffer becomes full, 
causing the microprocessor to stall until the vector unit(s) catch up. 

Each vector instruction either completes successfully or terminates in a hard 
error condition. Exceptions and other non-fatal conditions are signaled in sticky 
status registers that may be either polled or enabled to signal interrupts. Hard 
errors and enabled exception conditions are signaled to the microprocessor as 
interrupts via the Network Interface. 

The memory addresses on the node bus are physical addresses resulting from 
memory-map translation in the microprocessor. The memory map provides the 
necessary protection to ensure that the addressed location itself is in fact within 
a user’s permitted address space, but cannot prevent accesses to other locations 
by execution of vector instructions that use indirect addressing or memory 
strides. Additional protection is provided in each vector unit by bounds-checking 
hardware that signals an interrupt if specified physical address bounds are 
exceeded. 

Certain privileged vector unit operations are reserved for supervisor use. These 
include the interrupt management and memory management features. The super¬ 
visor can interrupt a user task at any time for task-switching purposes and can 
save the state of each vector unit for transparent restoration at a later time. 
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Chapter 18 


Global Architecture 


A single user process (as shown in Chapter 16) “views” the CM-5 system as a set 
of processing nodes plus a partition manager, with I/O and other extra-partitional 
activities being provided by the operating system. 

Supporting such processes, however, requires that the underlying system soft¬ 
ware make appropriate use of the global architecture provided by the CM-5’s 
communications networks. 

All the computational and I/O components of a CM-5 system interact through two 
networks, the Control Network and the Data Network. Every such component is 
connected through a standard CM-5 Network Interface. The NI presents a simple, 
synchronous 64-bit bus interface to a node or I/O processor, decoupling it both 
logically and electrically from the details of the network implementation. 

The Control Network supports communication patterns that may involve all the 
processors in a single operation; these include broadcasting, reduction, parallel 
prefix, synchronization, and error signaling. The Data Network supports point- 
to-point communications among the processors, with many independent 
messages in transit at once. 


18.1 The Network Interface 

The CM-5 Network Interface provides a memory-mapped control-register inter¬ 
face to a 64-bit processor memory bus. All network operations are initiated by 
writing data to specific addresses in the bus address space. 

Many of the control registers appear to be at more than one location in the physi¬ 
cal address space. When a control register is accessed, additional information is 
conveyed by the choice of which of its physical addresses was used for the 
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access; in other words, information is encoded in the address bits. For example, 
when the Control Network is to be used for a combining operation, the 
first — and perhaps only — bus transaction writes the data to be combined, and 
the choice of address indicates which combining operation is to be used. One of 
the address bits indicates whether the access has supervisor privileges; an error 
is signaled on an attempt to perform a privileged access using an unprivileged 
address. (Normally the operating system maps the unprivileged addresses into 
the address space of the user process, thereby giving the user program zero-over- 
head access to the network hardware while prohibiting user access to privileged 
features.) 

The logical interface is divided into a number of functional interfaces. See 
Figure 33. Each functional interface presents two FIFO interfaces, one for outgo¬ 
ing data and one for incoming data. A processor writes messages to the outgoing 
FIFO and pulls messages from the incoming FIFO, using the same basic protocol 
for each functional interface. Different functional interfaces, however, respond 
in different ways to these messages. For example, a Data Network interface treats 


to Control Network to Data Network 



Figure 33. CM-S Network Interface. 

The Network Interface contains several functional interfaces. Each is a pair of FIFO buffers at partic¬ 
ular memory locations. Through memory mapping, the Data Network and Control Network are 
directly accessible by user code. 
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the first 32 bits of a message as a destination address to which to send the 
remainder of the message; a Control Network combining interface forwards the 
message to be summed (or otherwise combined) with similar messages from all 
the other processors. 

Data is kept in each FIFO in 32-bit chunks. The memory-bus interface accepts 
both 32-bit and 64-bit bus transactions. Writing 64 bits thus pushes two 32-bit 
chunks onto an output FIFO; reading 64 bits pulls two chunks from an input FIFO. 

For outgoing data, there are two control registers called send and send_f irst. 
Writing data to the aend_£irst register initiates an outgoing message; address 
bits encode the intended total length of the message (measured in 32-bit chunks). 
Any additional data for that message is then written to the send register. After 
all the data for that message has been written, the program can test the aend_ok 
bit in a third control register. If die bit is 1, then the network has accepted the 
message and bears all further responsibility for handling it. If the bit is 0, then 
the data was not accepted (because the FIFO overflowed) and the entire message 
must be re-pushed into the FIFO at a later time. The aend_space control register 
may be checked before starting a message to see whether there is enough space 
in the FIFO to hold the entire message; this should be treated only as a hint, how¬ 
ever, because supervisor operations (such as task switching) might invalidate it. 
In many situations throughput is improved by pushing without checking first, in 
the expectation that the FIFO will empty out as fast as new data is being pushed. 
It is also p ermissib le to check the send_ok bit before all the data words for the 
message have been pushed; if it is 0, the message may be retried immediately. 

For incoming data, a processor can poll the receive_ok bit until it becomes 1, 
indicating that a message has arrived; alternatively, it can request that certain 
types of messages trigger an interrupt on arrival. In either case, the program can 
then check the receive_length_left field to find out how long the message 
is and then read the appropriate numb er of data words from the receive control 
register. 

The supervisor can always interrupt a user program and send its own message; 
this is done by deleting any partial user message, sending the supervisor mes¬ 
sage, and then forcing the aendjok bit for that unit to 0 before resuming the user 
program. To the user program it merely appears that the FIFO was temporarily 
full; the user program should then retry sending the message. The supervisor can 
also lock a send-FBFO, in which case it appears always to be full, or disable it, 
in which case user access will cause an interrupt. The supervisor can save and 
transparently restore the contents of any receive-FIFO. 
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Each Network Interface records interrupt signals and error conditions generated 
within its associated processor, exchanges error and interrupt information with 
the Control Network, and forwards interrupt and reset signals to its associated 
processor. 


18.2 The Control Network 

Each Network Interface contains an assortment of functional interfaces 
associated with the Control Network. All have the same dual-FIFO organization 
but differ in detailed function. 

Every Control Network operation potentially involves every processor. A pro¬ 
cessor may push a message into one of its functional interfaces at any time; 
shortly after all processors have pushed messages, the result becomes available 
to all processors. Messages of each type may be pipelined; a number of messages 
may be sent before any results are received and removed. (The exact depth of the 
pipeline varies from one functional interface to another.) The general idea is that 
every processor should send the same kinds of messages in the same order. The 
Control Network, however, makes no restrictions about wheat each processor 
sends or receives messages. In other words, processors need not be exactly syn¬ 
chronized to the Control Network; rather, the Control Network is the very means 
by which processors conduct synchronized communication en masse. 

There are exceptions to the rule that every processor must participate. The func¬ 
tional interfaces contain mode bits for abstaining. A processor may set the 
appropriate mode bit in its Network Interface in order to abstain from a particular 
type of operation; each operation of that type will then proceed without input 
from that processor or without delivering a result to that processor. A participat¬ 
ing processor is one that is not abs taining from a particular kind of Control 
Network operation. 


Broadcasting 

The broadcast interface handles broadcasting operations. There are actually 
three distinct broadcasting interfaces: one for user broadcast, one for supervisor 
broadcast, and one for interrupt broadcast. Access to the supervisor broadcast 
interface or interrupt broadcast interface is a privileged operation. 
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Only one processor may broadcast at a time. If another processor attempts to 
send a broadcast message before completion of a previous broadcast operation, 
the Control Network signals an error. 

A broadcast message is one to fifteen 32-bit words long. Shortly after a message 
is pushed into the broadcast send-FEFO, copies of the message are delivered to 
all participating processors. The user broadcast and supervisor broadcast inter¬ 
faces are identical in function except that the latter is reserved for supervisor use. 

An interrupt broadcast message causes every processor to receive an interrupt or 
reset signal. A processor can abstain from receiving interrupts, in which case it 
ignores interrupt messages when it receives them; but a processor cannot abstain 
from a reset signal (which causes the receiving NI and its associated processor 
to be reset). 

As an example of the use of broadcast interrupts, consider a partition mana ger 
coordinating the task-switching of user processes. When it is time to switch 
tasks, the PM uses the Control Network to send a broadcast interrupt to all nodes 
in the partition. This transfers control in each node to supervisor code, which can 
then read additional supervisor broadcast information about the task-switch 
operation (such as which task is up next). 


Combining 

The combine interface handles reduction and parallel prefix operations. A com¬ 
bine message is 32 to 128 bits long and is treated as a single integer value. There 
are four possible message types: reduction, parallel prefix, parallel suffix, and 
router-done. The router-done operation is simply a specialized logical OR reduc¬ 
tion that assists the processors in a protocol to determine whether Data Network 
communications are complete. Reduction, parallel prefix, and parallel suffix may 
combine the messages using any one of five operators: bitwise logical OR, bit¬ 
wise logical XOR, signed maximum, signed integer addition, and unsigned 
integer addition. (The only difference between signed and unsigned addition is 
in the reporting of overflow.) The message type and desired combining operation 
are encoded by address bits when writing the destination address to the 
send_£irst register. 

As an example, every processor might write a 64-bit integer to the combine inter¬ 
face, specifying signed integer addition reduction. Shortly after the last 
participating processors write their input values, the signed sum is delivered to 
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every participating processor, along with an indication of whether overflow 
occurred at any intermediate step. 

As another example, every processor might write a 32-bit integer to the combine 
interface, specifying signed maximum parallel prefix. Shortly after the last par¬ 
ticipating processors write their input values, every participating processor 
receives the largest among the values provided by itself and all lower-numbered 
processors. 

The combine interface also supports segmented parallel prefix (and parallel suf¬ 
fix) operations. Each combine interface contains a scanjstart flag; when this 
flag is 1, that NI is considered to begin a new segment for purposes of parallel 
prefix operations. 

Every participating processor must specify fire same message type and combin¬ 
ing operation. If, in the course of processing combine requests in order, the 
Control Network encounters different combine requests at the same time, it sig¬ 
nals an error. 


Globa! Operations 

Global bit operations produce the logical OR of one bit from every participating 
processor. There are three independent global operation interfaces, one synchro¬ 
nous and two asynchronous, which may be used completely independently of 
each other and of other Control Network functions. This makes them useful for 
signaling conditions and exceptions. 

The synchronous global interface is similar to the combine interface except that 
the operation is always a logical OR reduction and each message consists of a 
single bit Processors may provide their values at any time; shortly after the last 
participating processors have written their input bits, the logical OR is delivered 
as a single-bit message to every participating processor. 

Each asynchronous global interface produces a new value any time the value of 
any input is changed. Input values are continually transported, combined, and 
delivered throughout the Control Network without waiting for all processors to 
participate. Processors may alter their input bits at any time. These interfaces are 
best used to detect the transition from 0 to 1 in any processor or to detect the 
transition from 1 to 0 in all processors. (The NI signals an interrupt, if enabled, 
whenever a transition from 0 to 1 is observed.) 
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There are two asynchronous global interfaces, one for the user and one for the 
supervisor. Access to the supervisor asynchronous global interface is a privileged 
operation. 


Synchronization 

Both the synchronous global interface and the combine interface may be used to 
implement barrier synchronization: if every processor writes a message and then 
waits for the result, no processor passes the barrier until every processor has 
reached the barrier. The hardware implementation of this function provides 
extremely rapid synchronization of thousands of processors at once. Note that the 
router-done combine operation is designed specifically to support barrier syn¬ 
chronization during a Data Network operation, so that no processor abandons its 
effort to receive messages until all processors have indicated that they are no 
longer sending messages. 


Flushing the Control Network 

There is a special functional interface for clearing the intermediate state of com¬ 
bine messages, which may be required if an error or task switch occurs in the 
middle of a combine operation. A flush message behaves very much like a broad¬ 
cast message: shortly after one processor has sent such a message, all processors 
are notified that the flush operation has completed. Access to the flush functional 
interface is a privileged operation. 


Error Handling 

The Control Network is responsible for detecting certain kinds of communica¬ 
tions errors, such as an attempt to specify different combining operations at the 
same time. More important, it is responsible for distributing error signals 
throughout the system. Hard error signals are collected from the Data Network 
and all Network Interfaces; these error signals are combined by logical OR opera¬ 
tions and the result is redistributed to every Network Interface. 
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18.3 The Data Network 

Each Network Interface contains one Data Network functional interface. The 
first 32-bit chunk of a message is treated as a destination address; it must be fol¬ 
lowed by one to five additional 32-bit chunks of data. This data is sent through 
the Data Network and delivered to the receive-FIFO of the Network Interface at 
the specified destination address. Each message also bears a 4-bit tag, which is 
encoded by address bits when writing the destination address to the 
send_f irst register. The tag provides a cheap way to differentiate among a 
small number of message types. The supervisor can reserve certain tags for its 
own use; any attempt by the user to send a message with a reserved tag signals 
an error. The supervisor also controls a 16-bit interrupt mask register; when a 
message arrives, an interrupt is signaled to the destination processor if the mask 
bit corresponding to the message’s tag value is 1. 

A destination address may be physical or relative. A physical address specifies 
a particular Network Interface that may be anywhere in the system and is not 
checked for validity. Using a physical address is a privileged operation. A rela¬ 
tive address is bounds-checked, passed through a translation table, and added to 
a base register. A relative destination address is thus very much like a virtual 
memory address: it provides to a user program the illusion of a contiguous 
address space for the nodes running from 0 to one less than the number of pro¬ 
cessing elements. Access to the bounds register, translation table, or base register 
is a privileged operation; thus the supervisor can confine user messages within 
a partition. 

While programs may use an interrupt protocol to process received messages, data 
parallel programs usually use a receiver-polls protocol in which all processors 
participate. In the general case, each processor has some number of messages to 
send (possibly none). Each processor alternates between pushing outgoing mes¬ 
sages onto its Data Network send-FIFO and checking its Data Network 
receive-FIFO. If any attempt to send a message fails, that processor should then 
check the receive-FIFO for incoming messages. Once a processor has sent all its 
outgoing messages, it uses the Control Network combine interface to assert this 
fact; it then alternates between receiving incoming messages and checking the 
Control Network When all processors have asserted that they are done sending 
messages and all outstanding messages have been received, the Control Network 
asserts the router-done signal to indicate to all the processors that the commu¬ 
nications step is complete and they may proceed. 

For task-switching purposes, the supervisor can put the Data Network into All 
Fall Down (AFD) mode. Instead of trying to route messages to their intended 
destinations, the Data Network drops each one into the nearest node. The advan- 
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tage of this strategy is that no node receives more than a few hundred bytes of 
AFD messages, even if they were all originally intended for a single destination. 
The supervisor can then read them from the Data Network receive-FIFO and save 
them in memory as part of the user task state, re-sending them when that user 
task is resumed. 
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System Architecture 
and Administration 


The CM-5 system architecture provides for multiple task execution partitions, I/O 
devices, and fault detection and recovery. It supports a centralized system admin¬ 
istration facility that gives the a dminis trator flexibility to optimize the use of 
system resources. All these tasks are handled through various extended capabili¬ 
ties and privileged features of the Control Network and Data Network, with the 
assistance of a third network, the Diagnostic Network. 


19.1 The System Console 

Administration is managed from a System Console, a process executing on a 
control processor that has a Diagnostics Network interface. Large CM-5 systems 
will typically have a dedicated processor for administration; on small CM-5 sys¬ 
tems, the administration process may run on a control processor that also has 
other tasks. 

The System Console processor has a Diagnostics Network connection that 
allows it to address the entire system. It is responsible for configuring the system 
on power-up, for partitioning the system, and for manag in g the system as it 
changes due to repartitioning and hardware failures. A database containing the 
status of the overall system, kept up to date by the Diagnostics Network, helps 
it perform these tasks. 
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19.2 Allocation of Resources 

The CM-5 system provides flexible allocation of computational resources. The 
ad minis trator can subset processing nodes into partitions; the administrator can 
also allocate control processors to single or multiple I/O devices. 


Partitions 

The set of computational and network resources in use at any given instant by 
a single user task is called a partition. Each partition constitutes a complete task 
execution system that may be used for timesharing, batch processing, or both. 

The system administrator creates partitions dynamically, to best accommodate 
the site’s workload. Some a dminis trators may use a partitioning strategy that 
involves changing the partitioning two or three times during the course of a day. 
Other sites may stick with a single set of partitions for several days at a time. 

An administrator might, for example, create three partitions on a system: one 
dedicated to a production run of a single large application, a second one used for 
timeshared program development by day and scheduled batch processing by 
night, and a third small one dedicated to around-the-clock timeshared access. 

All partitions are joined by the Control Network and Data Network in a single 
integrated system. Resources can therefore be reallocated from one partition to 
another when necessary. For example, all partitions might be joined to form one 
giant partition in order to tackle a single giant application. As another example, 
if processors were to fail in the partition dedicated to a production run, they could 
be replaced (by reconfiguring the networks) with processors borrowed from 
another partition. The production run could then be rolled back to a prior check¬ 
point and resumed with minimal disruption, while the failed processors were 
powered down and, at a convenient time, physically replaced. 


I/O 

I/O devices and interfaces, like processing nodes, reside in specific areas of the 
network address space and are managed by control processors. The I/O resources 
they control are available to processes r unning on any partition. The Data Net¬ 
work transfers data between I/O devices and partitions, while the Control 
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Network is used by the operating system to monitor the transfers and signal 
errors. 


19.3 Partitions and Networks 

From a system view, the Control Network and Data Network are designed to 
provide 

■ the capability for flexible partitioning of computing resources 

* the isolation of each partition’s network activity 

■ high throughput for all cases of data transfer 

To see how this works, we look at the way in which the address space on these 
two networks is managed. 

Figure 34 shows a simplified view of address space management in the net¬ 
works. As this figure suggests, each of the superficially homogeneous networks 
is logically split by hardware-supported, software-configured mechanisms so as 
to devote a portion to each partition or I/O resource. Additional network capacity 
is dedicated to carrying traffic between the various partitions and devices that 
malcp. up the system at any given time. Network resources allotted to one parti¬ 
tion do not overlap those associated with another. Moreover, traffic from one 
partition to another, or between a partition and an I/O device, consumes no net¬ 
work resources belonging to any intervening partition. The network design thus 
guarantees that network traffic within one partition cannot affect the behavior or 
the performance of traffic in another partition. (The only exception occurs when 
processors fail and are logically replaced for the nonce by more distant proces¬ 
sors from another partition.) The design also allows the Data Network to 
guarantee each processing node at least 5 Mbytes/sec of I/O bandwidth, no matter 
where it is in the network. However the nodes are divided into partitions, there 
is always enough Data Network to serve each partition and enough left over to 
guarantee the stated I/O rate. 

When a CM-5 system is first powered up, reset, and bootstrapped, the networks 
form a single partition that spans the entire system. The operating system then 
creates a temporary partition for initializing the nodes. It also initializes the I/O 
devices. After the startup procedures have been completed, administration soft¬ 
ware establishes one or more operating partitions. 
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Within each partition, the Network Interfaces are assigned virtual network 
addresses starting at zero. User programs use virtual network addresses; they are 
translated by hardware into physical network addresses wherever necessary, in 
exactly the same way that a memory management unit translates virtual memory 
addresses to physical memory addresses. Therefore, a user program need not 
concern itself with the physical network addresses of the partition being used to 
execute it 



VME HiPPi 


Figure 34. Network support for multiple partitions. 

The processing nodes of a CM-5 system can be configured into two or more partitions. Each partition 
is assigned to a partition manager, a control processor that bears the responsibility for managin g the 
processes executing in that partition. The operating system configures the Control Network and Data 
Network to match the partition structure. Each partition has a dedicated portion of each network 
sufficient to provide that partition with the guaranteed minimum network bandwidth of 5 Mbytes/sec 
for the Control Network and 5 Mbytes/sec per node for the Data Network, regardless of destination. 
No matter how the partitions are configured, there is always additional network capacity for carrying 
data between partitions and I/O devices or from one partition to another. Therefore, system-wide data 
traffic does not interfere with or impede traffic that stays within a partition. 


168 


November 1993 

Copyright © 1993 Thinking Machines Corporation 










Chapter 19. System Architecture and Administration 


The translation of virtual network addresses includes protection checking that 
prevents a user process from sending messages to destinations outside its parti¬ 
tion. The supervisor can send messages from one partition to another; the 
mechanism is identical except that it is not subject to the same protection checks 
because for this purpose the supervisor uses absolute physical network addresses. 

I/O is coordinated by the operating system. User processes may transfer data to 
and from I/O devices, or to and from other user processes (through the facility 
of UNIX named pipes). In both cases, the operating system breaks up the data 
into messages and sends the messages through the Data Network. If the two user 
processes happen to be in the same partition, the message traffic is confined to 
that partition, not because of protection (the supervisor is responsible for sending 
the messages in this case) but simply as a consequence of the structure of the 
Data Network. 


19.4 Network Implementation 

The topology of the CM-5 Data Network is a fat tree , so called because some 
branches are “fatter” (of higher bandwidth) than others. See Figure 35. This kind 
of tree is actually more like a biological tree than the computer scientist’s usual 
notion of a tree. A biological tree has skinny twigs, but the limbs are merely 
slender, the branches are stout, and the trunk is truly fat. Even though there are 
a thousand twigs and only one trunk, the trunk still has the bandwidth to carry 
sap for all the twigs because it is fat. (Maybe a computer scientist’s tree really 
ought to be called a “skinny tree” — but of course it’s too late now to change the 
terminology. Computer scientists are also in the habit of drawing trees upside 
down, with the root at the top and the leaves at the bottom, and we follow this 
unnatural convention in drawing fat trees here.) 

The fat tree structure has a number of distinct advantages over two other topolo¬ 
gies, the hypercube and the 2D mesh, used in many other parallel computer 
systems (including the Connection Machine models CM-1 and CM-2). Like the 
mesh and hypercube, it can be divided into smaller pieces of the same topology: 
a mesh can be carved up into smaller meshes; the two halves of a hypercube are 
themselves hypercubes; and the subtrees of a fat tree are themselves fat trees. 
This allows the processors to be partitioned in a way that naturally partitions the 
network, so that each group of processors has its own dedicated portion of the 
network. Traffic among the processors in one partition does not compete for 
bandwidth with traffic within another partition. 
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The fat tree, however, has an additional property not shared by the other two: 
traffic between two partitions does not interfere with traffic internal to a third 
partition. Consider I/O traffic, for example, between partition 1 and the I/O 
devices in Figure 35. The I/O messages will travel through the upper part of the 
fat tree, passing over partition 2 rather than through it, as might happen in a mesh 
or hypercube topology. 

Figure 35 shows only an abstract binary fat tree. The CM-5 Data Network is actu¬ 
ally a 4-ary fat tree, where each node has four children. The size of a CM-5 Data 
Network is often described by its height, which is the base-4 logarithm of the 
number of network addresses spanned. (Put another way, die height of die net¬ 
work equals one-half the number of bits in a processor address.) A CM-5 Scale 
3 system, for example, contains a height-3 fat tree, which spans 4 3 = 64 network 
addresses, enough for 32 processing nodes plus control processors and I/O. 

Each internal node of the fat tree is implemented as a set of Data Network 
switches, each a separate VLSI chip. The numb er of switches per node depends 
on where it is in the tree; the closer to the root, the fewer nodes and the more 
switches per node. Each switch has four children and either two or four parents. 
See Figure 36, which illustrates a fat tree with 16 leaves. Each leaf represents one 
Network Interface. The level-1 nodes have two switches each; the level-2 nodes 
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figure 36. CM-5 fiat tree. 


have four switches each. As shown in the figure, each level-1 or level-2 switch 
has two parents. Switches at higher levels have four parents each; at every level 
above level 2, there are four links going up for every four links coming in from 
below, thus maintaining constant bandwidth per Network Interface no matter 
how large the network grows. 

The routing algorithm for the Data Network is very simple. The Network Inter¬ 
face compares the physical destination address to its own and determines how far 
up the tree the message must travel. The message can then take any path up the 
tree. This allows the switches to perform load balancing on the fly. Once the mes¬ 
sage has reached the necessary height in the tree, it must then follow a particular 
path down — but the path has the same description no matter what upward path 
was taken. For example, if a message that has reached height 2 is destined for 
the tenth processor in Figure 36, then no matter which of the four height-2 
switches it has reached, it must travel to the third child (which might be either 
of two height-1 switches) and then to that child’s second child. 

Figure 37 is a simplification of Figure 36, showing only the links connecting the 
switches (and the 16 leaves). Figure 38 then shows how 64 nodes are connected 
by taking four copies of the 16-node network and adding switches at height 3. 
Then 256 nodes may be connected by rising switches at height 4 to connect four 
copies of the 64-node network, and so on; see Figure 39, which shows a few 
height-4 switches and a few wires going up toward height 5. 


November 1993 

Copyright © 1993 Thinking Machines Corporation 


171 






Figure 37. Data Network with 16 nodes. 

The CM-5 Data Network has one other good property in addition to scalability, 
partitionability, and non-interference: it can continue to operate even when one 
or more switches have failed. See Figure 40. If a link breaks or a switch fails , 



Figure 38. Data Network with 64 nodes. 
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once the problem has been diagnosed then the Diagnostic Network can configure 
neighboring Data Network chips to ignore the failed components. The switch 
indicated by the arrow will then have one parent instead of two, so traffic through 
that switch will see only half the usual bandwidth. However, load balancing at 
lower levels will react by sending proportionately less traffic through that switch, 
so that messages soon share seven upgoing wires equally where before they had 
eight. The result is that the fat tree continues to function with slightly reduced 
bandwidth. 

A mesh with a switch missing is a mesh with a hole in it. In principle traffic could 
be routed around the missing switch, but the usual routing algorithms for a com¬ 
plete mesh are insufficient; moreover, oblivious routing algorithms, which use 



Figure 39. Data Network with 256 nodes. 
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Figure 40. Data Network redundancy. 


fixed routing paths rather than reacting on the fly to dy nami c network load or 
configuration, simply haven’t a chance. Similarly, a hypercube with a broken 
switch is a hypercube with a comer missing, and the usual hypercube routing 
algorithms break down. But a fat tree with miss ing switches is still a fat tree — 
it just happens to be a little skinnier in some places — so the standard message 
routing strategy continues to work. 

The Control Network may be thought of as a binary “skinny tree,” at least as far 
as a single user program is concerned. (It is actually a little bit fat, which pro¬ 
vides the switching capability necessary to allow any control processor to 
manage any partition and provides the same ability to function in die face of 
hardware failures as that of the Data Network.) Each switch in the tree has a 
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controller, connections to two children and one or two parents, and a small inte¬ 
ger arithmetic processor. 

Every Control Network operation sends information up the tree to the root 
(remember, computer scientists draw trees upside down) and then back down 
from the root to the leaves. There are separate links going up and down, so the 
Control Network can continuously pipeline information up and down the tree. 

To broadcast a message, the Control Network conveys the data from a single 
Network Interface up the tree to the root; as the data travels back down the tree, 
it is copied to both children at every switch. 

To reduce a set of numbers, each switch waits until both its children have pro¬ 
vided inputs; the switch then sends the combined result to its parent. When the 
root has computed the final combined result, it is then broadcast back to the 
leaves. Parallel prefix operations are similar but more complicated, with different 
results traveling back down the tree to different leaves. 

The Diagnostic Network is also tree-shaped, with the ability to broadcast diag¬ 
nostic commands throughout the system and to combine the results of diagnostic 
tests. Switching at each level of the tree allows selection of arbitrary subsets of 
chips for diagnosis in parallel. The chips that implement the Diagnostic Network 
are themselves subject to diagnostic commands; chips lower in the tree may be 
tested by chips higher in the tree. 


19.5 Resource Allocation and Management 

CM-5 administration uses standard UNIX mechanisms to control the usage of var¬ 
ious resources (disk usage, CPU usage, memory usage, and so on). These are 
enhanced for the CM-5 when necessary: for instance, stack and heap management 
can be set for the nodes in a partition as well as for the control processors. 

Similarly, standard UNIX procedures govern the mounting and maintenance of 
file systems. 
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19.6 Accounting, Monitoring, and Error Reporting 

Standard UNIX kernel and device drivers collect information on system activity. 
Accounting information is collected by ordinary UNIX tools, including NQS, and 
is logged to a central facility on the System Console. 

Errors occurring during normal operation of the CM system are detected by the 
operating system, collected and distributed by the Control Network. 

Hard error signals are collected from the Data Network and from every Network 
Interface. These signals are combined and distributed according to the current 
partitioning. Errors detected within a partition are signaled to every Network 
Interface in that partition, and are reported if appropriate to the user process 
running at the time of the error. Errors detected in portions of the network outside 
any partition may be optionally signaled into any designated partition. 

The operating system notifies the system administrator of errors by sending a 
message to the System Console processor. It also logs error information in a cen¬ 
tral system error log, from which it is available both to the administrator and to 
diagnostic utilities. System failures and transient hardware errors are also logged 
to central logging facilities on the System Console. 


19.7 Physical Monitoring Systems 

The CM-5 system includes extensive power and temperature monitoring systems, 
designed for early detection of problems that might cause physical damage to the 
system. The monitoring system reports electronic danger signals, such as detec¬ 
tion of an overheating cabinet, to the System Console. 


19.8 Fault Detection and Recovery 

The CM-5 system is designed to provide high system availability. An important 
aspect of this design is rapid diagnosis and smooth degradation in the face of 
component failures. An integrated part of the administration system, the CM-5 
diagnostic system is notable for its completeness, its speed, and the high degree 
of fault isolation it provides. If a failure should occur in a running partition, the 
administrator can interrogate all items in parallel, isolate the failing item, reparti¬ 
tion around the failure, and have the partition up and running again quickly. 
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In addition, the CM-5 provides hardware and software support for checkpointing, 
either at specified time intervals or by explicit program request. The goal is to 
allow user applications to be restarted with full system capabilities, even in the 
presence of failed components. 


The Diagnostic Network 

The Diagnostic Network, which can probe and control the rest of the system, 
handles diagnostics. This network is designed to be simple and reliable. It is not 
particularly fast compared to the Control Network or Data Network, but testing 
and diagnostic procedures are nevertheless speedy because the Diagnostics Net¬ 
work can operate on all parts of the system in parallel. 


The Diagnostic Library 

The CM-5 diagnostic library includes a wide variety of tests. Particularly 
noteworthy among these are the JTAG diagnostics. Based on the IEEE Standard 
1149.1, these scan-based vectors both test chips with a very high level of fault 
coverage and provide connectivity tests between chips (known as boundary scan 
checking). JTAG diagnostics exist for all CM-5 components; they provide 
extremely precise isolation of faults. This precision, in turn, allows rapid 
identification and replacement of failed components, and provides the data 
necessary for the administration database to exclude failed components when 
configuring partitions. 

Diagnostics are run by partition: thus, one partition can be running diagnostics 
while others are r unnin g user programs. Within the partition, the administrator 
can choose to run diagnostics on 

a the entire partition 

a a single subsystem, such as the Control Network 
a a single type of component, such as the nodes 

Parallel processing provides speed. In the last example, all nodes are tested in 
parallel and report their status in parallel. In the first example, diagnostic tests 
on all components of the partition are run in parallel. The Diagnostics Network 
can address test vectors to the entire system or to any subsystem, such as a 
backplane, that is believed to be broken. The status of multiple chips or boards 
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of the same type is read out in parallel, and components whose values differ from 
an expected value are quickly isolated. 


Diagnostics and Components 

All CM-5 components are designed to be testable when in place in the system. 
Nearly all data paths are protected by parity or full CRC. All dynamic memory 
is protected by full ECC that corrects single-bit errors and detects double-bit 
errors and DRAM chip failures. Transfers through the Control Network and Data 
Network are checked by hardware, not merely end-to-end but on every link, so 
that network component failures can be located precisely. 

Failed components can be logically and electrically isolated from die rest of the 
system under control of the Diagnostics Network. Surrounding components are 
instructed to ignore any and all signals from failed components. The failed sec¬ 
tion of the system can then independently execute diagnostic tests or be powered 
down for repair or replacement, while the rest of the system continues normal 
operation. 

All major CM-5 system components use either redundant or spare component 
schemes. If a processing node fails, its local group of nodes is taken out of ser¬ 
vice and can be logically replaced by any other such group from anywhere in the 
system. All control processors are logically interchangeable; any control proces¬ 
sor can manage any partition. 

If a Control Network component Mis, the consequences depend on the location 
of the failure within the network. It may be necessary to give up the use of 1/64 
of the Network Interfaces in that partition and whatever they are connected to. 
In this case, spare processors may be logically mapped in to replace them. In 
other cases, the failure implies the loss of one partition. For example, if a CM-5 
system supports up to 8 different partitions, then a Control Network failure mi ght 
reduce die maximum number of partitions to 7 — but the processing resources 
in the failed partition could be reallocated to other partitions. 

If a Data Network component fails, the consequences similarly depend on the 
location of the failure. It may be necessary to give up the use of 1/64 of the Net¬ 
work Interfaces in that partition and whatever they are connected to. In other 
cases, no Network Interface need be abandoned; the total global bandwidth of the 
Data Network is diminished, but never by more than about 6 percent for each 
failure. 
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I/O devices are also designed to tolerate failures; disk arrays, for example, are 
designed to tolerate the failure of one or more disk units without loss of data. See 
the descriptions of individual I/O devices for details. ' 
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The CM-5 system achieves true I/O scalability by connecting I/O devices directly 
to the CM-5 Data Network, achieving three important goals of the I/O system: 

■ Any I/O device, or collection of I/O devices, has the bandwidth needed to 
achieve high throughput. If a device needs more bandwidth than one Data 
Network connection can supply, it is given multiple connections. 

" A system’s I/O storage capacity is independent of the system’s computa¬ 
tional capacity: I/O capacity is expanded merely by adding Data Network 
connections. Since the bandwidth of the CM-5 Data Network grows lin¬ 
early with the number of connections, the performance of the Data 
Network expands to meet the needs of the additional I/O devices. 

■ I/O resources are shared equally. Any partition or any computer connected 
to the CM-5 by a LAN may access all CM-5 I/O devices. I/O devices may 
also communicate directly with one another, facilitating, for example, 
direct disk-to-tape copies without the use of partition resources. In addi¬ 
tion, the Data Network allows multiple partitions to perform I/O 
operations simultaneously, without affecting each other’s performance or 
data integrity. 

In addition to scalable bandwidth, the CM-5 I/O system supports industry soft¬ 
ware standards, such as UNIX file system access and TCP/IP networking with 
other machines. The CM-5 operating system (CMOST) provides straightforward, 
consistent methods for performing I/O and accommodates a wide range of I/O 
devices, such as the Scalable Disk Array, Integrated Tape System, HIPPI and 
FDDI network connections, and Thinking Machines’ CMIO bus devices. These 
CMIO bus devices, such as the Data Vault mass storage system, are supported 
both on the CM-5 and an the CM-2 and CM-200. They thus allow multiple Con¬ 
nection Machine systems to share I/O devices and to have direct access to the 
data in those devices. 
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To a user application, the CM-5 I/O system consists of a collection of virtual I/O 
devices, any of which can be home to a file system. File formatting and physical 
characteristics of the devices are invisible to the application code. Implementa¬ 
tion details for each I/O device are hidden by a combination of hardware and 
software interface modules. 


20.1 I/O Architecture 

Every I/O device is connected to the CM-5 through the Data Network, with each 
I/O interface occupying a block of Data Network address space. By convention, 
all I/O devices occupy the upper region of that address space. (See Figure 41.) 
An I/O interface attaches to the Data Network through one or more Network 
Interfaces — the same type of interface that connects processing nodes to the 
Data Network. 

An important consideration in any I/O scheme is matching the system’s internal 
bandwidth to the I/O rates of peripheral devices attached to the system. Here, the 
Data Network’s intrinsic scalability plays a critical role. The number of Network 
Interfaces used to attach an I/O device to the Data Network determines how much 
of the network’s bandwidth is made available to the device. The more ports an 
interface has into the Data Network, the greater its potential bandwidth. 

An I/O interface with a single Network Interface provides a nominal bandwidth 
of 20 Mbytes/sec across the Data Network. This capacity can easily accommo¬ 
date low- and medium-speed I/O devices. Interfaces for high-performance 
peripherals are implemented with as many Network Interfaces as are needed to 
support the transfer rates required by the particular device. The Scalable Disk 
Array, for example, has one Network Interface for each Disk Storage Node; 
when a Storage Node is added for increased capacity, a Network Interface is 
added also, increasing the device’s aggregate transfer rate. The CM5-HIPPI, 
which can provide 160 Mbytes aggregate bandwidth, has eight Network Inter¬ 
faces. 
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Figure 41. CM-5 I/O subsystem block diagram. 


Each I/O interface requires a control processor to act as its file server and super¬ 
vise all I/O operations for that device. For the Scalable Disk Array, the processor 
is known as an I/O control processor (IOCP). The IOCP has the same capabilities 
as other CM-5 control processors, and so may be used for other functions, as well 
as for file server functions. For the Integrated Tape System, the Connection 
Machine HDPPI devices, and the CMIO bus devices, the file server is embedded 
within the device itself. 


20.2 File System Environment 

The CM-5 system can access three file systems: SFS, CMFS, and UNIX. SFS (Scal¬ 
able File System) and CMFS (Connection Machine File System), both 
proprietary to Thinking Machines Corporation, exploit the great speed and mas¬ 
sive storage capabilities of the Connection Machine I/O systems. The UNIX file 
system further enhanc es system usability. All three file systems organize files 
into directories, use pathnames to identify them, and treat all I/O devices as files. 
Following is a brief description of each file system: 

* SFS is NFS-mounted on an IOCP and manages the files stored on a Scal¬ 
able Disk Array or Integrated Tape System. The SFS file system is a fully 
compatible enhancement of the UNIX file system, with extensions to sup- 
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port parallel I/O and much larger files than most UNIX implementations 
can accommodate. A CM-5 program can access the SFS file system via CM 
Fortran, C*, the CMMD library, a subset of the CMFS library and the CMFS 
commands, and standard UNIX routines and commands. 

From a user’s perspective, the SFS file sj'stem behaves like a UNIX file 
system: for example, it stores each file’s data in canonical serial order. 

■ CMFS is a UNIX-like file system that can reside on the CMIO-bus data¬ 
storage devices. Like the SFS file system, the CMFS file system has 
extensions to support parallel I/O and very large files. A CM-5 program 
can access the CMFS file system via the same interfaces that are used for 
the SFS file system (to use the UNIX commands and calls, however, the 
CMFS file system must be NFS-mounted). 

From a user’s perspective, although the CMFS file system is similar to the 
UNIX file system, there are some differences in its appearance and behav¬ 
ior. For example, there are special environment variables that apply only 
to the CMFS file system. 

■ The standard implementation of the UNIX file system can reside cm all 
CM-5 control processors and cm all other serial computers in Connection 
Machine systems. 

Each file system is completely separate from all others, having a separate direc¬ 
tory tree and its own current working directory. 

Regardless of which file system is being accessed, all I/O transactions are mod¬ 
eled as reads and writes to files. UNIX-style openQ and close () requests go 
to dm file server and thus are independent of the file system data storage imple¬ 
mentation. Requests for other file system operations, such as reading and writing 
files, also go to die file server, which then directs the transfer of data. In these 
cases, data may be t ransf erred in parallel directly from the source to the destina¬ 
tion, without passing through the file server. For the SFS file system, data and 
control information travel separately through the Data Network. (See Figure 42.) 
For the CMFS file system, data travels through the Data Network, while control 
information travels over the Ethernet. For the UNIX file system, data and control 
info rmat ion both travel over the Ethernet. 
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Figure 42. Independent control and data paths through the Data Network. 


When an application program requires I/O services, the partition on which the 
application is running initiates the file transfer with an appropriate read or write 
command. It directs the I/O request to the appropriate file server, which assumes 
control of the transfer. 

For a read() operation, the file is retrieved from the I/O device, encapsulated 
in message packets, and sent through the Data Network to the partition that 
requested the data. File order information embedded in the message packets 
enables the receiving partition to arrange the file data in correct sequence within 
each processor. A write () operation is s imil ar but reverses the flow of data. 

Different versions of read () and write () are used, depending on whether an 
application is running on a single processor or on a set of parallel processing 
nodes. A serial application uses die conventional UNIX read() and write () 
commands. Parallel applications use parallel read and parallel write routines. 

Multiple I/O devices may be logically ganged for striped operation as a single file 
system. The SFS file system automatically routes data between requesting pro¬ 
cessors and individual I/O devices so that all striping is transparent. This same 
facility makes SFS file structure independent of the number of computational 
processors that read or write the data and transparent to programmers. 
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20.3 CM-5 I/O interfaces and Device Implementation 

This section describes the SDA, ITS, and CM5-HIPPI, the key elements of the 
CM-5 I/O subsystem implementation. The CMIO bus peripherals and the two 
standard bus interfaces (SVME and SBA) are described in Section 20.4. 


20.3.1 Scalable Disk Array 

The Scalable Disk Array (SDA) is an extremely high performance, highly 
expandable RAID-3 disk storage system composed of Disk Storage Nodes pack¬ 
aged within CM-5 cabinetry. The basic Disk Storage Node — providing 9.2 
Gbytes of storage, a peak bandwidth over 17 Mbytes/sec, and 25 Mips of proces¬ 
sing power — comprises a controller built on a SPARC processor, a Network 
Interface, a large disk buffer, four advanced SCSI controllers, and eight 3.5" hard 
disk drives. (See Figure 43.) The drives are mounted in removable modules that 
facilitate installation and removal during servicing operations. These parts are 
widely available and represent mature high-volume technologies, thereby con¬ 
tributing to the high reliability of the SDA system. Additional custom hardware 
is provided to augment the transfer of data directly between the disks, the buffer, 
and the Data Network. 

The SDA Disk Storage Nodes — analogous to the computational processing 
nodes offered with the CM-5 architecture — are directly connected to the CM-5’s 
Data Network. The direct connection to the Data Network enables each Disk 
Storage Node to contribute not only to storage capacity but also to I/O perfor¬ 
mance; the number of the Disk Storage Nodes in an SDA system can be increased 
or decreased, thereby achieving an I/O system matched to the performance and 
capacity needs of the CM-5 applications. 

For example, a single Disk Array Module comprising 3 Disk Storage Nodes pro¬ 
vides 25 Gbytes of storage at I/O bandwidths of up to 33 Mbytes/sec sustained. 
Adding one more Disk Storage Node increases capacity to over 33 Gbytes and 
I/O bandwidth to 44 Mbytes/sec sustained. 

The number of Disk Array Modules that can be installed in a CM-5 system is 
limited only by the number of address spaces the system contains. A full cabi¬ 
net’s worth of SDA — eight Disk Array Modules — uses 256 Data Network 
addresses and provides 200 Gbytes of storage capacity. 
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Figure 43. Block diagram of a Disk Storage Node. 


Data Storage 

Data stored on the Disk Storage Nodes is always available in serial order. The 
CM-5 stripes the data across all the Disk Storage Nodes in the system, so that 
each node contributes to the overall transfer rate available to the user. All disks 
in the SDA act together, in conjunction with the operating system, to transfer data 
simultaneously. In fulfilling a read request, for example, the operating system 
knows how an array is spread across the processors, and automatically distributes 
the data coming from the SDA to the correct locations. From the application’s 
perspective, the SDA appears to be a single, high-capacity, high-performance 
disk. 

Special-purpose hardware on the Disk Storage Node controller assists the operat¬ 
ing system in handling data-ordering issues, providing a seamless mechanism for 
moving data between the Disk Storage Nodes, partitions of processing nodes, 
other CM-5 I/O devices, other serial computers, and I/O devices connected to 
them. This combination of hardware and software support relieves the applica- 
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tions programmer from the burden of dealing with data ordering, without 
compromising performance. 


Availability and Serviceability 

In addition to 22 data-storage disks, a Disk Array Module contains 1 parity disk 
and 1 spare disk. The parity disk stores redundant information — used to recre¬ 
ate the data should a disk fail — automatically generated by the operating system 
as a simple parity summing operation: 

■ On disk writes, parity is generated in the processing nodes and transferred 
with the data to the parity disk. 

* On disk reads, all the data, including that from the parity disk, is sent to 
the processing nodes, which calculate and check the parity. 

In the event of a drive failure, a sparing and healing operation is performed. 
Sparing is a software procedure that logically replaces the failed drive with the 
SDA’s spare drive (provided for that purpose); healing is the process that recon¬ 
structs the corrupted data using the parity information and stores it on the spare 
drive. The entire operation takes about one hour, each phase of which is signaled 
by the LEDs. 

The sparing and healing utilities, as well as utilities that facilitate other adminis¬ 
trative tasks such as running SDA diagnostics and controlling the error-logging 
and -repenting mechanism, are provided through the SDA Command Center 
(SDACC). The SDACC is a menu-driven program that runs on the SDA’s IOCP, 
allowing the system administrator to maintain the SDA in a directed manne r with 
minimal down-time and no guess-work. 


Failure Detection 

Essential to maintaining high availability is the detection of failures in the data 
paths. Failures are detected by parity checking circuits located in all the major 
data paths and in all the controller and processor buffos. In addition, internal to 
the disk drives, extensive use of error-correcting codes provide for complete 
correction of bit-error bursts up to 48 bits in length, and detection of error bursts 
as long as 120 bits. 
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20.3J2 Integrated Tape System 

The Integrated Tape System (ITS) is an industry standard-compliant tape-con¬ 
troller system that, like the SDA, is integrated into the CM-5 cabinetry, with direct 
connections to the Data Network. The ITS offers the same unlimited scalability 
as the other parts of the CM-5 system: dozens or even hundreds of ITS modules 
can be connected to a single CM-5, each accommodating, for example, two 
IBM 3480-compatible tape drives or eight 8mm drives. The drives can be located 
as far as 20 meters away from the CM-5 cabinet. 

The ITS tape controller, shown in Figure 44, is built on a SPARC processor, a 
Network Interface, an 8-Mbyte data buffer, and two advanced 16-bit SCSI-2 
channels. The ITS supports industry-standard tape formats such as 3480 square 
tapes (200 or 400 Mbytes per tape) and 8-mm cartridges (5 Gbytes per tape). The 
ITS tape controller can sustain a maximum throughput of 15 Mbytes/sec and can 
burst at up to 20 Mbytes/sec. 



Figure 44. An ITS system. The tape devices can be located 
up to 20 meters away from the CM-5. 
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The ITS tape controllers are backplane-compatible with the SDA Disk Storage 
Nodes: ITS tape controllers and Disk Storage Nodes can be intermixed. The total 
number of each type of board installed in a system is limited only by available 
address space. 

All software needed to control the ITS tape drives and negotiate the data transfers 
is included in a CMTape software package that includes standard UNIX tape faci¬ 
lities as well as CM system-specific functionality. Users gain access to the ITS 
either through CMTape commands executed from a UNIX shell or through 
CMTape utilities callable from applications. Included in the CMTape package are 
functions for performing: 

■ disk back-ups and restores 

■ label processing of IBM- and ANSI-standard label tapes 

■ common format conversions (for example, between EBCDIC and ASCII) 

All these functions support file sizes greater than the 2-Gbyte restriction imposed 
by UNIX. In addition, applications for the Connection Machine models CM-2 and 
CM-200 CM-IOPG (CM I/O processor) can be run on the ITS with only minimal 
changes. 

The SDA is used to stage data as it is written to the ITS from the CM-5 or read 
from the ITS to the CM-5. Transfers between disk and tape occur directly — no 
CM-5 partition resources are involved in the transfer. 


20.3.3 Connection Machine HIPPI Interfaces 

CM-HIPPI and CM5-HEPPI are bus interface controllers that are designed to trans¬ 
fer data at a high speed according to the ANSI HIPPI draft standard. The 
interfaces are primarily intended to link the CM-5 and its storage devices to other 
supercomputer systems. See Figure 45. 

CM-HIPPI is a complete, integrated system that receives CMFS file system com¬ 
mands from a CM control processor over an Ethernet cable. The CM-HIPPI 
contains a CPU, disk drives, a VMEbus, and HIPPI input and output interface 
modules. The modules provide a full-duplex I/O interface between a pair of 
external HIPPI buses and a pair of internal buses, one each for inc oming and out¬ 
going data. The internal buses connect the input and output interface modules to 
up to eight HIPPI-to-CMIO interface modules via a set of multiplexing switches. 
Using an I/O bus adaptor backplane in die CM-5, the switches establish and break 
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links between a CMIO bus and the HIPPI input and output ports. The peak aggre¬ 
gate I/O bandwidth of this configuration is 25 Mbytes/sec. 

CM-HIPPI supports 

■ HIPPI-PH and HEPPI-FP 

■ Multiplexed access to a HIPPI connection 



Figure 45. Typical HIPPI network with Connection Machine systems. 


CM5-HJPPI occupies 1/8 of a standard CM-5 cabinet and provides a 32-bit source 
and destination intraface to the I/O bulkhead of the machine. The CM5-HIPPI 
connects directly to the CM-5 Data Network to provide 160 Mbytes/sec aggre- 
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gate bandwidth. There is an on-board diagnostic loopback between source and 
destination. In addition to the CM-HTPPI features, CM5-HIPPI supports 

* Users’ HtPPI framing protocols 

* TCP/IP 


20.4 CMIO Bus Device Implementation 

For application portability, the family of CM-2 and CM-200 peripherals continues 
to be supported on the CM-5 system. These devices reside on Thinking 
Machines’ proprietary CMIO bus, which is connected to die Data Network via 
the IOBA (I/O bus adapter). The CMIO bus peripherals are: 

* Data Vault. This is a high-performance, disk-based mass storage system. 
It allows applications running on a CM-5 to access as much as 60 gigabytes 
of random access storage per Data Vault at I/O bandwidths of up to 25 
Mbytes/sec. 

■ CM-IOPG. This I/O controller provides 4 ports for connection to SCSI- 
based devices, such as cartridge tape drives. 

Figure 46 shows a CM-5 system that includes a Data Vault. 



Figure 46. A sample CM I/O system for the CM-5. 
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The CM-5 also supports I/O for standard VMEbus and SBus devices. The VMEbus 
and SBus interfaces allow the CM-5 to be connected to external control proces¬ 
sors and peripheral device controllers that implement these popular buses. These 
links make available to CM-5 applications a variety of other forms of I/O, includ¬ 
ing framebuffers, tertiary storage devices, and FDDI networks. 


20.4.1 DataVault 

The DataVault system is available in various storage capacities, ranging from 
20 to 60 gigabytes. Each of these configurations is capable of transferring data 
at a sustained rate of 25 Mbytes/sec. 

The basic DataVault storage configuration, used in the 20- and 30-gigabyte sys¬ 
tems, employs an array of 42 5 V 4 -inch Winchester disk drives, of which 39 are 
active and 3 are spares. (See Figure 47.) Of the 39 active drives, 32 hold data and 
7 hold error correction code (ECC) bits. The ECC bits allow the DataVault to cor¬ 
rect single-bit errors and to flag multiple-bit errors in each 32-bit value retrieved 
from the disks. 

The double-capacity DataVault configuration (40 or 60 gigabytes) has 84 drives, 
of which 64 hold data, 14 hold ECC bits, and 6 are spares. 

la all DataVault configurations, each 32-bit data word is spread across 39 data 
and ECC drives, one bit per drive. Each 64-bit data chunk received from the 
CMIO bus is first split into two 32-bit words. After verifying parity from the I/O 
bus, the DataVault controller adds 7 ECC bits and stores the resulting 39 bits on 
39 individual drives. Subsequent failure of any one of the 39 drives does not 
impair reading of the data, since the ECC data allows any single-bit error to be 
detected and corrected for every data word. The ECC data permits 100% recov¬ 
ery of the contents of a failed disk, allowing a new copy of this data to be 
reconstructed and written onto a spare disk. Once this recovery is complete, the 
database is healed. 


The File Server 

All I/O transactions in a DataVault I/O interface are controlled by an IOCP run¬ 
ning a file server process. The file server manages the Data Vault’s UNIX-based 
hierarchical directory structure, handling the allocation of physical disk space 
and matching file names and logical read/write requests to the physical locations 
of data on the DataVault disks. 
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Figure 47. Inside the DataVauit. 

Internally, the file server represents a file as a series of extents, or areas of contig¬ 
uous disk surface. Each extent starts at a logical offset within the file, has a 
physical disk address, and has a length. This representation allows a file to have 
arbitrarily large physically contiguous blocks of the disks holding data for log¬ 
ically contiguous segments of the file. As a result, positioning of the read/write 
heads is more efficient, yielding faster file transfer. 


Writing and Reading Data 

Data transfers move in format ion between a CM-5 partition and the DataVauit. 
The principal events involved in writing a file to the DataVauit are summarized 
below. Reading a file from the DataVauit into Connection Machine memory is 
very similar but the flow of data is reversed. 

A DataVauit write operation is typically initiated by a partition manager, which 
issues a write command to the IOCP that is acting as the DataVault’s file server. 
When the file server receives the logical file request, it translates the request into 
a series of physical disk addresses. Assuming that the request parameters satisfy 
the necessary validity checks (for example, that there is sufficient space), the file 
server returns a message to the requesting partition indicating the DataVault’s 
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availability. If the request cannot be fulfilled, the file server returns a failure 
report instead. 

Data from the partition’s memory is moved, via the Data Network, to I/O buffers 
in the CMIO interface and then across the CMIO bus to the Data Vault. A micro¬ 
controller within the Data Vault controls the distribution of data onto the disk 
array. State machines at each end of the CMIO bus ensure reliable transfer of 
large volumes of data across the bus. Parity checking is performed on all data as 
it is received from the CMIO bus to ensure data integrity. 

Data being read from the DataVault follows the same path as for writing, but in 
reverse order: across the CMIO bus, through the CMIO interface, and across the 
Data Network. The data coming off the disks is checked by ECC circuits. Single¬ 
bit errors are corrected and logged and the data is written with parity to the CMIO 
bus. As with write operations, parity checking is performed on data received 
from the CMIO bus. 


Data Protection 

The transfer status may indicate that a single disk drive is failing and that the ECC 
was required to correct data. This will most often be discovered when the error 
logs are checked. At that point, the faulty drive can be physically replaced with 
an external spare. If the site does not currently have any spares available in stor¬ 
age, other than the three (or six) spare drives contained in the DataVault, one of 
these internal spares can be logically substituted for the failing drive. 

This logical substitution uses a software procedure, called sparing, drat recon¬ 
structs the corrupted data, using the ECC circuits to correct the failing bit, and 
stores it on one of the spare drives provided for the purpose. The sparing program 
redirects the path followed by the faulty bit from the failing drive to the spare. 
Regeneration of this data takes two minutes per gigabyte, after which the data is 
again protected against the failure of another drive. 

When the failed drive is physically replaced, the files are reconstructed using the 
same technique as is used when sparing the failed drive. 
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20.4.2 Standard Protocol I/O Interfaces 

Two CM-5 standard bus interfaces, called SVME and SBA, enable the CM-5 oper¬ 
ating system to access external VMEbus or SBus computers and their associated 
I/O resources using standard communications protocols. These I/O paths link the 
CM-5 to external networks of computing and I/O server resources. 

They are connected by cable to a VMEbus- or SBus-based ext ernal control pro¬ 
cessor, which manages the file system and other I/O functions. An adapter board, 
installed in the control processor, provides an interface between the VMEbus and 
the CM-5 Data Network (and Control Network). 

The SBus interface consists of an adapter board, called the SBA, that plugs into 
the SBus of an external control processor. This adapter board is connected by 
cable to an interface module, called the control processor interface (CPI), that is 
plugged into the CM-5 Data Network and Control Network. The external control 
processor, running file server code, serves as the file system processor. This 
arrangement allows applications running on the CM-5 to exploit any I/O 
resources, such as a tape storage system, that are attached to die external control 
processor’s SBus. 

In a similar fashion, the VMEbus interface uses a VME adapter board, called the 
SVME, to connect a control processor’s VMEbus to the CM-5 Data Network and 
Control Network. Apart from this difference, the VMEbus interface employs the 
same design features as the SBus interface. 
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